{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "7fb27b941602401d91542211134fc71a",
      "metadata": {
        "id": "intro-quick-start",
        "language": "markdown"
      },
      "source": [
        "# Quick Start\n",
        "\n",
        "Notebook-style examples, using real `nmdc_client` modules.\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "acae54e37e7d407bbb7b55eff062a284",
      "metadata": {
        "id": "ex1-overview",
        "language": "markdown"
      },
      "source": [
        "## Example 1: Study -> Biosamples -> Data Objects\n",
        "\n",
        "**Goal:** Start from an NMDC study name, retrieve matching study metadata records.  Then collect all linked biosamples and data objects' metadata records that are related to that study.\n",
        "\n",
        "### Step 1: Find matching study records"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "9a63283cbaf04dbcab1f6479b197f3a8",
      "metadata": {
        "id": "ex1-step1",
        "language": "python"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Studies found: 1\n",
            "Study ID: nmdc:sty-11-8ws97026\n"
          ]
        }
      ],
      "source": [
        "from nmdc_client import StudySearch\n",
        "\n",
        "study_client = StudySearch()\n",
        "\n",
        "study_name = (\n",
        "    \"Molecular mechanisms underlying changes in the temperature sensitive \"\n",
        "    \"respiration response of forest soils to long-term experimental warming\"\n",
        ")\n",
        "\n",
        "studies = study_client.get_record_by_attribute(\n",
        "    attribute_name=\"name\",\n",
        "    attribute_value=study_name,\n",
        "    exact_match=True,\n",
        ")\n",
        "\n",
        "print(f\"Studies found: {len(studies)}\")\n",
        "if studies:\n",
        "    print(f\"Study ID: {studies[0]['id']}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8dd0d8092fe74a7c96281538738b07e2",
      "metadata": {
        "id": "ex1-step2-text",
        "language": "markdown"
      },
      "source": [
        "### Step 2: Get linked biosamples\n",
        "\n",
        "If at least one study is found, use the first study ID to request linked biosample records. The `get_linked_instances` method called like this will return a list of fully hydrated biosample metadata records. Note that the `get_linked_instances` method can be used to retrieve linked records of any type, not just biosamples, that are associated with any list of IDs (not just study IDs).  It is available on both the `StudySearch` and `NMDCSearch` clients, as well as all the additional search clients that are available in `nmdc_client`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "72eea5119410473aa328ad9291626812",
      "metadata": {
        "id": "ex1-step2-code",
        "language": "python"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Biosamples found: 42\n",
            "Biosample IDs collected: 42\n",
            "Example biosample record: \n",
            "{'id': 'nmdc:bsm-11-7twwzs96', 'type': 'nmdc:Biosample', '_downstream_of': ['nmdc:sty-11-8ws97026'], 'name': 'Inc-BW-C-30-O', 'associated_studies': ['nmdc:sty-11-8ws97026'], 'env_broad_scale': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'forest biome [ENVO:01000174]', 'term': {'id': 'ENVO:01000174', 'type': 'nmdc:OntologyClass', 'name': 'forest biome'}}, 'env_local_scale': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'organic horizon [ENVO:03600018]', 'term': {'id': 'ENVO:03600018', 'type': 'nmdc:OntologyClass', 'name': 'organic horizon'}}, 'env_medium': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'forest soil [ENVO:00002261]', 'term': {'id': 'ENVO:00002261', 'type': 'nmdc:OntologyClass', 'name': 'forest soil'}}, 'samp_name': 'Inc-BW-C-30-O', 'collection_date': {'type': 'nmdc:TimestampValue', 'has_raw_value': '2017-05-24'}, 'depth': {'type': 'nmdc:QuantityValue', 'has_raw_value': '0 - .02', 'has_maximum_numeric_value': 0.02, 'has_minimum_numeric_value': 0.0, 'has_unit': 'm'}, 'ecosystem': 'Environmental', 'ecosystem_category': 'Terrestrial', 'ecosystem_subtype': 'Temperate forest', 'ecosystem_type': 'Soil', 'elev': 302.0, 'env_package': {'type': 'nmdc:TextValue', 'has_raw_value': 'soil'}, 'experimental_factor': {'type': 'nmdc:ControlledTermValue', 'has_raw_value': 'dideuterium oxide [CHEBI:41981]', 'term': {'id': 'CHEBI:41981', 'type': 'nmdc:OntologyClass', 'name': 'dideuterium oxide'}}, 'geo_loc_name': {'type': 'nmdc:TextValue', 'has_raw_value': 'USA: Massachusetts, Petersham'}, 'growth_facil': {'type': 'nmdc:ControlledTermValue', 'has_raw_value': 'lab_incubation'}, 'lat_lon': {'type': 'nmdc:GeolocationValue', 'has_raw_value': '42.481016 -72.178343', 'latitude': 42.481016, 'longitude': -72.178343}, 'samp_store_temp': {'type': 'nmdc:QuantityValue', 'has_raw_value': '-80 Celsius', 'has_numeric_value': -80.0, 'has_unit': 'Cel'}, 'specific_ecosystem': 'O horizon/Organic', 'store_cond': {'type': 'nmdc:TextValue', 'has_raw_value': 'frozen'}, 'analysis_type': ['metagenomics', 'metatranscriptomics'], 'sample_link': ['nmdc:bsm-11-3n3k2m62'], 'gold_biosample_identifiers': ['gold:Gb0157183']}\n"
          ]
        }
      ],
      "source": [
        "biosamples = []\n",
        "biosample_ids = []\n",
        "\n",
        "if studies:\n",
        "    study_id = studies[0][\"id\"]\n",
        "    biosamples = study_client.get_linked_instances(\n",
        "        ids=[study_id],\n",
        "        types=[\"nmdc:Biosample\"],\n",
        "        hydrate=True,\n",
        "    )\n",
        "    biosample_ids = [record[\"id\"] for record in biosamples if \"id\" in record]\n",
        "\n",
        "print(f\"Biosamples found: {len(biosamples)}\")\n",
        "print(f\"Biosample IDs collected: {len(biosample_ids)}\")\n",
        "print(f\"Example biosample record: \\n{biosamples[0] if biosamples else 'No biosamples found'}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8edb47106e1a46a883d545849b8ab81b",
      "metadata": {
        "id": "ex1-step3-text",
        "language": "markdown"
      },
      "source": [
        "### Step 3: Get linked data objects\n",
        "\n",
        "Use biosample IDs as seeds for another linked-instances query targeting `nmdc:DataObject`.  For this example, we'll just pull from a subset (5) of the biosample IDs to avoid pulling too many records, but in practice you could pull all linked data objects if desired."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "10185d26023b46108eb7d9f57d49d2b3",
      "metadata": {
        "id": "ex1-step3-code",
        "language": "python"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Data objects found: 507\n",
            "\n",
            "Example data object \n",
            "ID: nmdc:dobj-14-1e9qeq49\n",
            "Name: Hawkes_neg.ref\n",
            "Description: Natural organic matter negative electrospray mode reference calibration file for Calibration object nmdc:calib-14-hhn3qb47\n",
            "Link for downloading: https://nmdcdemo.emsl.pnnl.gov/nom/reference_calibration_files/Hawkes_neg.ref\n"
          ]
        }
      ],
      "source": [
        "data_objects = []\n",
        "\n",
        "if biosample_ids:\n",
        "    data_objects = study_client.get_linked_instances(\n",
        "        ids=biosample_ids[0:5],  # Just using the first 5 biosample IDs for this example\n",
        "        types=[\"nmdc:DataObject\"],\n",
        "        hydrate=True,\n",
        "    )\n",
        "\n",
        "print(f\"Data objects found: {len(data_objects)}\")\n",
        "if data_objects:\n",
        "    print(f\"\\nExample data object \\nID: {data_objects[0].get('id')}\")\n",
        "    print(f\"Name: {data_objects[0].get('name')}\")\n",
        "    print(f\"Description: {data_objects[0].get('description')}\")\n",
        "    print(f\"Link for downloading: {data_objects[0].get('url')}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8763a12b2bbd4a93a75aff182afb95dc",
      "metadata": {
        "id": "ex2-overview",
        "language": "markdown"
      },
      "source": [
        "---\n",
        "\n",
        "## Example 2: Data Object Type -> Data Objects -> Biosample Metadata\n",
        "\n",
        "**Goal:** Start from a `data_object_type` value, retrieve matching data objects, then resolve associated biosample metadata. See schema documentation for more details on the `data_object_type` property and its allowed values: https://microbiomedata.github.io/nmdc-schema/FileTypeEnum/.\n",
        "\n",
        "### Step 1: Retrieve data objects by type"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "id": "7623eae2785240b9bd12b16a66d81610",
      "metadata": {
        "id": "ex2-step1",
        "language": "python"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Data objects found: 25\n",
            "Data object IDs collected: 25\n"
          ]
        },
        {
          "data": {
            "application/vnd.microsoft.datawrangler.viewer.v0+json": {
              "columns": [
                {
                  "name": "index",
                  "rawType": "int64",
                  "type": "integer"
                },
                {
                  "name": "id",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "type",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "name",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "file_size_bytes",
                  "rawType": "float64",
                  "type": "float"
                },
                {
                  "name": "md5_checksum",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "data_object_type",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "was_generated_by",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "url",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "description",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "data_category",
                  "rawType": "str",
                  "type": "string"
                },
                {
                  "name": "alternative_identifiers",
                  "rawType": "object",
                  "type": "unknown"
                },
                {
                  "name": "in_manifest",
                  "rawType": "object",
                  "type": "unknown"
                }
              ],
              "ref": "54b290b5-35b9-4723-981a-daaf0ef96e96",
              "rows": [
                [
                  "0",
                  "nmdc:dobj-11-00pns528",
                  "nmdc:DataObject",
                  "52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz",
                  "10028966274.0",
                  "0c70f5574024426432ea03eb3f130e01",
                  "Metagenome Raw Reads",
                  "nmdc:omprc-11-rdvzce03",
                  "https://data.microbiomedata.org/data/nmdc:omprc-11-rdvzce03/nmdc:omprc-11-rdvzce03/52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz",
                  "Metagenome Raw Reads for nmdc:omprc-11-rdvzce03",
                  "instrument_data",
                  null,
                  null
                ],
                [
                  "1",
                  "nmdc:dobj-11-00xqnn30",
                  "nmdc:DataObject",
                  "12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz",
                  "8449439373.0",
                  "c8b53dad43beabb80d768f81ec1f0f9b",
                  "Metagenome Raw Reads",
                  "nmdc:omprc-11-g8x3ed38",
                  "https://data.microbiomedata.org/data/nmdc:omprc-11-g8x3ed38/nmdc:omprc-11-g8x3ed38/12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz",
                  "Metagenome Raw Reads for nmdc:omprc-11-g8x3ed38",
                  "instrument_data",
                  null,
                  null
                ],
                [
                  "2",
                  "nmdc:dobj-11-00y67656",
                  "nmdc:DataObject",
                  "52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz",
                  "10974063112.0",
                  "d56df876ceb5006d0d0546c8e67500ee",
                  "Metagenome Raw Reads",
                  "nmdc:omprc-11-3pn7ex35",
                  "https://data.microbiomedata.org/data/nmdc:omprc-11-3pn7ex35/nmdc:omprc-11-3pn7ex35/52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz",
                  "Metagenome Raw Reads for nmdc:omprc-11-3pn7ex35",
                  "instrument_data",
                  null,
                  null
                ],
                [
                  "3",
                  "nmdc:dobj-11-02dj2e39",
                  "nmdc:DataObject",
                  "52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz",
                  "8848502361.0",
                  "9dd6311f11abe3a960796d2c227d9d19",
                  "Metagenome Raw Reads",
                  "nmdc:omprc-11-dsv4yv97",
                  "https://data.microbiomedata.org/data/nmdc:omprc-11-dsv4yv97/nmdc:omprc-11-dsv4yv97/52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz",
                  "Metagenome Raw Reads for nmdc:omprc-11-dsv4yv97",
                  "instrument_data",
                  null,
                  null
                ],
                [
                  "4",
                  "nmdc:dobj-11-02n14844",
                  "nmdc:DataObject",
                  "52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz",
                  "11009293694.0",
                  "1978fc04e0ce845651f0bdd4e0be1eb3",
                  "Metagenome Raw Reads",
                  "nmdc:omprc-11-0w5g0a55",
                  "https://data.microbiomedata.org/data/nmdc:omprc-11-0w5g0a55/nmdc:omprc-11-0w5g0a55/52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz",
                  "Metagenome Raw Reads for nmdc:omprc-11-0w5g0a55",
                  "instrument_data",
                  null,
                  null
                ]
              ],
              "shape": {
                "columns": 12,
                "rows": 5
              }
            },
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>type</th>\n",
              "      <th>name</th>\n",
              "      <th>file_size_bytes</th>\n",
              "      <th>md5_checksum</th>\n",
              "      <th>data_object_type</th>\n",
              "      <th>was_generated_by</th>\n",
              "      <th>url</th>\n",
              "      <th>description</th>\n",
              "      <th>data_category</th>\n",
              "      <th>alternative_identifiers</th>\n",
              "      <th>in_manifest</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>nmdc:dobj-11-00pns528</td>\n",
              "      <td>nmdc:DataObject</td>\n",
              "      <td>52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz</td>\n",
              "      <td>1.002897e+10</td>\n",
              "      <td>0c70f5574024426432ea03eb3f130e01</td>\n",
              "      <td>Metagenome Raw Reads</td>\n",
              "      <td>nmdc:omprc-11-rdvzce03</td>\n",
              "      <td>https://data.microbiomedata.org/data/nmdc:ompr...</td>\n",
              "      <td>Metagenome Raw Reads for nmdc:omprc-11-rdvzce03</td>\n",
              "      <td>instrument_data</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>nmdc:dobj-11-00xqnn30</td>\n",
              "      <td>nmdc:DataObject</td>\n",
              "      <td>12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz</td>\n",
              "      <td>8.449439e+09</td>\n",
              "      <td>c8b53dad43beabb80d768f81ec1f0f9b</td>\n",
              "      <td>Metagenome Raw Reads</td>\n",
              "      <td>nmdc:omprc-11-g8x3ed38</td>\n",
              "      <td>https://data.microbiomedata.org/data/nmdc:ompr...</td>\n",
              "      <td>Metagenome Raw Reads for nmdc:omprc-11-g8x3ed38</td>\n",
              "      <td>instrument_data</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>nmdc:dobj-11-00y67656</td>\n",
              "      <td>nmdc:DataObject</td>\n",
              "      <td>52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz</td>\n",
              "      <td>1.097406e+10</td>\n",
              "      <td>d56df876ceb5006d0d0546c8e67500ee</td>\n",
              "      <td>Metagenome Raw Reads</td>\n",
              "      <td>nmdc:omprc-11-3pn7ex35</td>\n",
              "      <td>https://data.microbiomedata.org/data/nmdc:ompr...</td>\n",
              "      <td>Metagenome Raw Reads for nmdc:omprc-11-3pn7ex35</td>\n",
              "      <td>instrument_data</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>nmdc:dobj-11-02dj2e39</td>\n",
              "      <td>nmdc:DataObject</td>\n",
              "      <td>52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz</td>\n",
              "      <td>8.848502e+09</td>\n",
              "      <td>9dd6311f11abe3a960796d2c227d9d19</td>\n",
              "      <td>Metagenome Raw Reads</td>\n",
              "      <td>nmdc:omprc-11-dsv4yv97</td>\n",
              "      <td>https://data.microbiomedata.org/data/nmdc:ompr...</td>\n",
              "      <td>Metagenome Raw Reads for nmdc:omprc-11-dsv4yv97</td>\n",
              "      <td>instrument_data</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>nmdc:dobj-11-02n14844</td>\n",
              "      <td>nmdc:DataObject</td>\n",
              "      <td>52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz</td>\n",
              "      <td>1.100929e+10</td>\n",
              "      <td>1978fc04e0ce845651f0bdd4e0be1eb3</td>\n",
              "      <td>Metagenome Raw Reads</td>\n",
              "      <td>nmdc:omprc-11-0w5g0a55</td>\n",
              "      <td>https://data.microbiomedata.org/data/nmdc:ompr...</td>\n",
              "      <td>Metagenome Raw Reads for nmdc:omprc-11-0w5g0a55</td>\n",
              "      <td>instrument_data</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                      id             type  \\\n",
              "0  nmdc:dobj-11-00pns528  nmdc:DataObject   \n",
              "1  nmdc:dobj-11-00xqnn30  nmdc:DataObject   \n",
              "2  nmdc:dobj-11-00y67656  nmdc:DataObject   \n",
              "3  nmdc:dobj-11-02dj2e39  nmdc:DataObject   \n",
              "4  nmdc:dobj-11-02n14844  nmdc:DataObject   \n",
              "\n",
              "                                        name  file_size_bytes  \\\n",
              "0  52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz     1.002897e+10   \n",
              "1  12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz     8.449439e+09   \n",
              "2  52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz     1.097406e+10   \n",
              "3  52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz     8.848502e+09   \n",
              "4  52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz     1.100929e+10   \n",
              "\n",
              "                       md5_checksum      data_object_type  \\\n",
              "0  0c70f5574024426432ea03eb3f130e01  Metagenome Raw Reads   \n",
              "1  c8b53dad43beabb80d768f81ec1f0f9b  Metagenome Raw Reads   \n",
              "2  d56df876ceb5006d0d0546c8e67500ee  Metagenome Raw Reads   \n",
              "3  9dd6311f11abe3a960796d2c227d9d19  Metagenome Raw Reads   \n",
              "4  1978fc04e0ce845651f0bdd4e0be1eb3  Metagenome Raw Reads   \n",
              "\n",
              "         was_generated_by                                                url  \\\n",
              "0  nmdc:omprc-11-rdvzce03  https://data.microbiomedata.org/data/nmdc:ompr...   \n",
              "1  nmdc:omprc-11-g8x3ed38  https://data.microbiomedata.org/data/nmdc:ompr...   \n",
              "2  nmdc:omprc-11-3pn7ex35  https://data.microbiomedata.org/data/nmdc:ompr...   \n",
              "3  nmdc:omprc-11-dsv4yv97  https://data.microbiomedata.org/data/nmdc:ompr...   \n",
              "4  nmdc:omprc-11-0w5g0a55  https://data.microbiomedata.org/data/nmdc:ompr...   \n",
              "\n",
              "                                       description    data_category  \\\n",
              "0  Metagenome Raw Reads for nmdc:omprc-11-rdvzce03  instrument_data   \n",
              "1  Metagenome Raw Reads for nmdc:omprc-11-g8x3ed38  instrument_data   \n",
              "2  Metagenome Raw Reads for nmdc:omprc-11-3pn7ex35  instrument_data   \n",
              "3  Metagenome Raw Reads for nmdc:omprc-11-dsv4yv97  instrument_data   \n",
              "4  Metagenome Raw Reads for nmdc:omprc-11-0w5g0a55  instrument_data   \n",
              "\n",
              "  alternative_identifiers in_manifest  \n",
              "0                     NaN         NaN  \n",
              "1                     NaN         NaN  \n",
              "2                     NaN         NaN  \n",
              "3                     NaN         NaN  \n",
              "4                     NaN         NaN  "
            ]
          },
          "execution_count": 4,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import json\n",
        "\n",
        "import pandas as pd\n",
        "\n",
        "from nmdc_client import DataObjectSearch\n",
        "\n",
        "data_object_client = DataObjectSearch()\n",
        "\n",
        "data_object_type = \"Metagenome Raw Reads\"\n",
        "filter_str = json.dumps({\"data_object_type\": data_object_type})\n",
        "\n",
        "data_object_records = data_object_client.get_record_by_filter(\n",
        "    filter=filter_str,\n",
        "    all_pages=False,  # Set to True to retrieve all matching records across all pages of results\n",
        ")\n",
        "data_objects = pd.DataFrame(data_object_records)\n",
        "data_object_ids = data_objects.dropna(subset=[\"id\"])[\"id\"].tolist()\n",
        "\n",
        "print(f\"Data objects found: {len(data_objects)}\")\n",
        "print(f\"Data object IDs collected: {len(data_object_ids)}\")\n",
        "data_objects.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7cdc8c89c7104fffa095e18ddfef8986",
      "metadata": {
        "id": "ex2-step2-text",
        "language": "markdown"
      },
      "source": [
        "### Step 2: Resolve linked biosample IDs\n",
        "\n",
        "Build a mapping from each data object ID to biosample IDs using linked instances."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "id": "b118ea5561624da68c537baed56e602f",
      "metadata": {
        "id": "ex2-step2-code",
        "language": "python"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Objects with biosample links: 25\n",
            "Unique biosample IDs: 25\n"
          ]
        }
      ],
      "source": [
        "associations = {}\n",
        "biosample_ids = []\n",
        "\n",
        "if data_object_ids:\n",
        "    associations = data_object_client.get_linked_instances_and_associate_ids(\n",
        "        ids=data_object_ids,\n",
        "        types=[\"nmdc:Biosample\"],\n",
        "        hydrate=False,\n",
        "    )\n",
        "    for data_object_id in data_object_ids:\n",
        "        associations.setdefault(data_object_id, [])\n",
        "\n",
        "    biosample_ids = sorted(\n",
        "        {\n",
        "            biosample_id\n",
        "            for linked_ids in associations.values()\n",
        "            for biosample_id in linked_ids\n",
        "        }\n",
        "    )\n",
        "\n",
        "print(f\"Objects with biosample links: {len(associations)}\")\n",
        "print(f\"Unique biosample IDs: {len(biosample_ids)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "938c804e27f84196a10c8828c723f798",
      "metadata": {
        "id": "ex2-step3-text",
        "language": "markdown"
      },
      "source": [
        "### Step 3: Fetch biosample metadata and attach by data object\n",
        "\n",
        "Retrieve biosample records and build a per-data-object metadata mapping.  Similar to the `get_linked_instances` method, the `get_records_by_id` method is also available across multiple clients in `nmdc_client` and can be used to retrieve fully hydrated metadata records for any list of IDs, even if those IDs are not linked to each other or do not belong to a common collection."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "id": "504fb2a444614c0babb325280ed9130a",
      "metadata": {
        "id": "ex2-step3-code",
        "language": "python"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Biosamples fetched: 25\n",
            "Objects with mapped biosamples: 25\n"
          ]
        }
      ],
      "source": [
        "biosample_records = []\n",
        "if biosample_ids:\n",
        "    biosample_records = data_object_client.get_records_by_id(ids=biosample_ids)\n",
        "\n",
        "biosamples_by_id = {\n",
        "    record[\"id\"]: record\n",
        "    for record in biosample_records\n",
        "    if \"id\" in record\n",
        "}\n",
        "\n",
        "biosamples_by_data_object = {}\n",
        "for data_object_id in data_object_ids:\n",
        "    linked_ids = associations.get(data_object_id, [])\n",
        "    biosamples_by_data_object[data_object_id] = [\n",
        "        biosamples_by_id[biosample_id]\n",
        "        for biosample_id in linked_ids\n",
        "        if biosample_id in biosamples_by_id\n",
        "    ]\n",
        "\n",
        "print(f\"Biosamples fetched: {len(biosample_records)}\")\n",
        "print(f\"Objects with mapped biosamples: {len(biosamples_by_data_object)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6867dca8",
      "metadata": {},
      "source": [
        "## Example 3: Study -> Biosamples -> ++ Biosamples\n",
        "\n",
        "**Goal:** Starting from a study of interest, increase the size of your database by searching for additional biosamples within a certain radius of those you've already found. \n",
        "\n",
        "### Step 1: Find biosamples from your study of interest"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "id": "3d7a9780",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Biosamples from original study: 104\n"
          ]
        }
      ],
      "source": [
        "from nmdc_client import StudySearch\n",
        "\n",
        "study_client = StudySearch()\n",
        "\n",
        "study_id = \"nmdc:sty-11-8xdqsn54\"\n",
        "\n",
        "studies = study_client.get_record_by_attribute(\n",
        "    attribute_name=\"id\",\n",
        "    attribute_value=study_id,\n",
        "    exact_match=True,\n",
        ")\n",
        "\n",
        "biosamples = study_client.get_linked_instances(\n",
        "    ids=[study_id],\n",
        "    types=[\"nmdc:Biosample\"],\n",
        "    hydrate=True,\n",
        ")\n",
        "biosample_ids = [record[\"id\"] for record in biosamples if \"id\" in record]\n",
        "\n",
        "print(f\"Biosamples from original study: {len(biosamples)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "577df78a",
      "metadata": {},
      "source": [
        "### Step 2: Find additional biosamples within a radius of the original biosamples"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "id": "7f81f627",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Additional biosamples found: 79\n"
          ]
        }
      ],
      "source": [
        "from nmdc_client import BiosampleSearch\n",
        "\n",
        "biosample_client = BiosampleSearch()\n",
        "\n",
        "add_biosamples = []\n",
        "for biosample_id in biosample_ids:\n",
        "    new_biosamples = biosample_client.get_record_by_proximity(\n",
        "        radius_km=2000,\n",
        "        record_id=biosample_id,\n",
        "        all_pages=False\n",
        "    )\n",
        "    for biosample in new_biosamples:\n",
        "        if biosample[\"id\"] not in biosample_ids and biosample[\"id\"] not in [b[\"id\"] for b in add_biosamples]:\n",
        "            add_biosamples.append(biosample)\n",
        "print(f\"Additional biosamples found: {len(add_biosamples)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "59bbdb311c014d738909a11f9e486628",
      "metadata": {
        "id": "field-links",
        "language": "markdown"
      },
      "source": [
        "---\n",
        "## Search client selection and schema field names\n",
        "\n",
        "The above examples use the `DataObjectSearch` and `StudySearch` clients because the initial filtering was targeted to a specific collection metadata records (data_object_set and study_set, respectively).  To help orient yourself to which client to use for a given query, you can refer to the [typecode-to-class map in the NMDC Schema documentation](https://microbiomedata.github.io/nmdc-schema/typecode-to-class-map/), which shows which schema classes are associated with each typecode and therefore which clients will be able to filter by those schema fields.\n",
        "\n",
        "For additional query recipes and MongoDB-style filters, see the Filters page in this documentation set."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "nmdc-api-utilities (3.13.5)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.13.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}