{ "cells": [ { "cell_type": "markdown", "id": "7fb27b941602401d91542211134fc71a", "metadata": { "id": "intro-quick-start", "language": "markdown" }, "source": [ "# Quick Start\n", "\n", "Notebook-style examples, using real `nmdc_client` modules.\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "acae54e37e7d407bbb7b55eff062a284", "metadata": { "id": "ex1-overview", "language": "markdown" }, "source": [ "## Example 1: Study -> Biosamples -> Data Objects\n", "\n", "**Goal:** Start from an NMDC study name, retrieve matching study metadata records. Then collect all linked biosamples and data objects' metadata records that are related to that study.\n", "\n", "### Step 1: Find matching study records" ] }, { "cell_type": "code", "execution_count": 1, "id": "9a63283cbaf04dbcab1f6479b197f3a8", "metadata": { "id": "ex1-step1", "language": "python" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Studies found: 1\n", "Study ID: nmdc:sty-11-8ws97026\n" ] } ], "source": [ "from nmdc_client import StudySearch\n", "\n", "study_client = StudySearch()\n", "\n", "study_name = (\n", " \"Molecular mechanisms underlying changes in the temperature sensitive \"\n", " \"respiration response of forest soils to long-term experimental warming\"\n", ")\n", "\n", "studies = study_client.get_record_by_attribute(\n", " attribute_name=\"name\",\n", " attribute_value=study_name,\n", " exact_match=True,\n", ")\n", "\n", "print(f\"Studies found: {len(studies)}\")\n", "if studies:\n", " print(f\"Study ID: {studies[0]['id']}\")" ] }, { "cell_type": "markdown", "id": "8dd0d8092fe74a7c96281538738b07e2", "metadata": { "id": "ex1-step2-text", "language": "markdown" }, "source": [ "### Step 2: Get linked biosamples\n", "\n", "If at least one study is found, use the first study ID to request linked biosample records. The `get_linked_instances` method called like this will return a list of fully hydrated biosample metadata records. Note that the `get_linked_instances` method can be used to retrieve linked records of any type, not just biosamples, that are associated with any list of IDs (not just study IDs). It is available on both the `StudySearch` and `NMDCSearch` clients, as well as all the additional search clients that are available in `nmdc_client`." ] }, { "cell_type": "code", "execution_count": 2, "id": "72eea5119410473aa328ad9291626812", "metadata": { "id": "ex1-step2-code", "language": "python" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Biosamples found: 42\n", "Biosample IDs collected: 42\n", "Example biosample record: \n", "{'id': 'nmdc:bsm-11-7twwzs96', 'type': 'nmdc:Biosample', '_downstream_of': ['nmdc:sty-11-8ws97026'], 'name': 'Inc-BW-C-30-O', 'associated_studies': ['nmdc:sty-11-8ws97026'], 'env_broad_scale': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'forest biome [ENVO:01000174]', 'term': {'id': 'ENVO:01000174', 'type': 'nmdc:OntologyClass', 'name': 'forest biome'}}, 'env_local_scale': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'organic horizon [ENVO:03600018]', 'term': {'id': 'ENVO:03600018', 'type': 'nmdc:OntologyClass', 'name': 'organic horizon'}}, 'env_medium': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'forest soil [ENVO:00002261]', 'term': {'id': 'ENVO:00002261', 'type': 'nmdc:OntologyClass', 'name': 'forest soil'}}, 'samp_name': 'Inc-BW-C-30-O', 'collection_date': {'type': 'nmdc:TimestampValue', 'has_raw_value': '2017-05-24'}, 'depth': {'type': 'nmdc:QuantityValue', 'has_raw_value': '0 - .02', 'has_maximum_numeric_value': 0.02, 'has_minimum_numeric_value': 0.0, 'has_unit': 'm'}, 'ecosystem': 'Environmental', 'ecosystem_category': 'Terrestrial', 'ecosystem_subtype': 'Temperate forest', 'ecosystem_type': 'Soil', 'elev': 302.0, 'env_package': {'type': 'nmdc:TextValue', 'has_raw_value': 'soil'}, 'experimental_factor': {'type': 'nmdc:ControlledTermValue', 'has_raw_value': 'dideuterium oxide [CHEBI:41981]', 'term': {'id': 'CHEBI:41981', 'type': 'nmdc:OntologyClass', 'name': 'dideuterium oxide'}}, 'geo_loc_name': {'type': 'nmdc:TextValue', 'has_raw_value': 'USA: Massachusetts, Petersham'}, 'growth_facil': {'type': 'nmdc:ControlledTermValue', 'has_raw_value': 'lab_incubation'}, 'lat_lon': {'type': 'nmdc:GeolocationValue', 'has_raw_value': '42.481016 -72.178343', 'latitude': 42.481016, 'longitude': -72.178343}, 'samp_store_temp': {'type': 'nmdc:QuantityValue', 'has_raw_value': '-80 Celsius', 'has_numeric_value': -80.0, 'has_unit': 'Cel'}, 'specific_ecosystem': 'O horizon/Organic', 'store_cond': {'type': 'nmdc:TextValue', 'has_raw_value': 'frozen'}, 'analysis_type': ['metagenomics', 'metatranscriptomics'], 'sample_link': ['nmdc:bsm-11-3n3k2m62'], 'gold_biosample_identifiers': ['gold:Gb0157183']}\n" ] } ], "source": [ "biosamples = []\n", "biosample_ids = []\n", "\n", "if studies:\n", " study_id = studies[0][\"id\"]\n", " biosamples = study_client.get_linked_instances(\n", " ids=[study_id],\n", " types=[\"nmdc:Biosample\"],\n", " hydrate=True,\n", " )\n", " biosample_ids = [record[\"id\"] for record in biosamples if \"id\" in record]\n", "\n", "print(f\"Biosamples found: {len(biosamples)}\")\n", "print(f\"Biosample IDs collected: {len(biosample_ids)}\")\n", "print(f\"Example biosample record: \\n{biosamples[0] if biosamples else 'No biosamples found'}\")" ] }, { "cell_type": "markdown", "id": "8edb47106e1a46a883d545849b8ab81b", "metadata": { "id": "ex1-step3-text", "language": "markdown" }, "source": [ "### Step 3: Get linked data objects\n", "\n", "Use biosample IDs as seeds for another linked-instances query targeting `nmdc:DataObject`. For this example, we'll just pull from a subset (5) of the biosample IDs to avoid pulling too many records, but in practice you could pull all linked data objects if desired." ] }, { "cell_type": "code", "execution_count": 3, "id": "10185d26023b46108eb7d9f57d49d2b3", "metadata": { "id": "ex1-step3-code", "language": "python" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data objects found: 507\n", "\n", "Example data object \n", "ID: nmdc:dobj-14-1e9qeq49\n", "Name: Hawkes_neg.ref\n", "Description: Natural organic matter negative electrospray mode reference calibration file for Calibration object nmdc:calib-14-hhn3qb47\n", "Link for downloading: https://nmdcdemo.emsl.pnnl.gov/nom/reference_calibration_files/Hawkes_neg.ref\n" ] } ], "source": [ "data_objects = []\n", "\n", "if biosample_ids:\n", " data_objects = study_client.get_linked_instances(\n", " ids=biosample_ids[0:5], # Just using the first 5 biosample IDs for this example\n", " types=[\"nmdc:DataObject\"],\n", " hydrate=True,\n", " )\n", "\n", "print(f\"Data objects found: {len(data_objects)}\")\n", "if data_objects:\n", " print(f\"\\nExample data object \\nID: {data_objects[0].get('id')}\")\n", " print(f\"Name: {data_objects[0].get('name')}\")\n", " print(f\"Description: {data_objects[0].get('description')}\")\n", " print(f\"Link for downloading: {data_objects[0].get('url')}\")" ] }, { "cell_type": "markdown", "id": "8763a12b2bbd4a93a75aff182afb95dc", "metadata": { "id": "ex2-overview", "language": "markdown" }, "source": [ "---\n", "\n", "## Example 2: Data Object Type -> Data Objects -> Biosample Metadata\n", "\n", "**Goal:** Start from a `data_object_type` value, retrieve matching data objects, then resolve associated biosample metadata. See schema documentation for more details on the `data_object_type` property and its allowed values: https://microbiomedata.github.io/nmdc-schema/FileTypeEnum/.\n", "\n", "### Step 1: Retrieve data objects by type" ] }, { "cell_type": "code", "execution_count": 4, "id": "7623eae2785240b9bd12b16a66d81610", "metadata": { "id": "ex2-step1", "language": "python" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data objects found: 25\n", "Data object IDs collected: 25\n" ] }, { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "id", "rawType": "str", "type": "string" }, { "name": "type", "rawType": "str", "type": "string" }, { "name": "name", "rawType": "str", "type": "string" }, { "name": "file_size_bytes", "rawType": "float64", "type": "float" }, { "name": "md5_checksum", "rawType": "str", "type": "string" }, { "name": "data_object_type", "rawType": "str", "type": "string" }, { "name": "was_generated_by", "rawType": "str", "type": "string" }, { "name": "url", "rawType": "str", "type": "string" }, { "name": "description", "rawType": "str", "type": "string" }, { "name": "data_category", "rawType": "str", "type": "string" }, { "name": "alternative_identifiers", "rawType": "object", "type": "unknown" }, { "name": "in_manifest", "rawType": "object", "type": "unknown" } ], "ref": "54b290b5-35b9-4723-981a-daaf0ef96e96", "rows": [ [ "0", "nmdc:dobj-11-00pns528", "nmdc:DataObject", "52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz", "10028966274.0", "0c70f5574024426432ea03eb3f130e01", "Metagenome Raw Reads", "nmdc:omprc-11-rdvzce03", "https://data.microbiomedata.org/data/nmdc:omprc-11-rdvzce03/nmdc:omprc-11-rdvzce03/52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz", "Metagenome Raw Reads for nmdc:omprc-11-rdvzce03", "instrument_data", null, null ], [ "1", "nmdc:dobj-11-00xqnn30", "nmdc:DataObject", "12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz", "8449439373.0", "c8b53dad43beabb80d768f81ec1f0f9b", "Metagenome Raw Reads", "nmdc:omprc-11-g8x3ed38", "https://data.microbiomedata.org/data/nmdc:omprc-11-g8x3ed38/nmdc:omprc-11-g8x3ed38/12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz", "Metagenome Raw Reads for nmdc:omprc-11-g8x3ed38", "instrument_data", null, null ], [ "2", "nmdc:dobj-11-00y67656", "nmdc:DataObject", "52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz", "10974063112.0", "d56df876ceb5006d0d0546c8e67500ee", "Metagenome Raw Reads", "nmdc:omprc-11-3pn7ex35", "https://data.microbiomedata.org/data/nmdc:omprc-11-3pn7ex35/nmdc:omprc-11-3pn7ex35/52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz", "Metagenome Raw Reads for nmdc:omprc-11-3pn7ex35", "instrument_data", null, null ], [ "3", "nmdc:dobj-11-02dj2e39", "nmdc:DataObject", "52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz", "8848502361.0", "9dd6311f11abe3a960796d2c227d9d19", "Metagenome Raw Reads", "nmdc:omprc-11-dsv4yv97", "https://data.microbiomedata.org/data/nmdc:omprc-11-dsv4yv97/nmdc:omprc-11-dsv4yv97/52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz", "Metagenome Raw Reads for nmdc:omprc-11-dsv4yv97", "instrument_data", null, null ], [ "4", "nmdc:dobj-11-02n14844", "nmdc:DataObject", "52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz", "11009293694.0", "1978fc04e0ce845651f0bdd4e0be1eb3", "Metagenome Raw Reads", "nmdc:omprc-11-0w5g0a55", "https://data.microbiomedata.org/data/nmdc:omprc-11-0w5g0a55/nmdc:omprc-11-0w5g0a55/52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz", "Metagenome Raw Reads for nmdc:omprc-11-0w5g0a55", "instrument_data", null, null ] ], "shape": { "columns": 12, "rows": 5 } }, "text/html": [ "
| \n", " | id | \n", "type | \n", "name | \n", "file_size_bytes | \n", "md5_checksum | \n", "data_object_type | \n", "was_generated_by | \n", "url | \n", "description | \n", "data_category | \n", "alternative_identifiers | \n", "in_manifest | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "nmdc:dobj-11-00pns528 | \n", "nmdc:DataObject | \n", "52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz | \n", "1.002897e+10 | \n", "0c70f5574024426432ea03eb3f130e01 | \n", "Metagenome Raw Reads | \n", "nmdc:omprc-11-rdvzce03 | \n", "https://data.microbiomedata.org/data/nmdc:ompr... | \n", "Metagenome Raw Reads for nmdc:omprc-11-rdvzce03 | \n", "instrument_data | \n", "NaN | \n", "NaN | \n", "
| 1 | \n", "nmdc:dobj-11-00xqnn30 | \n", "nmdc:DataObject | \n", "12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz | \n", "8.449439e+09 | \n", "c8b53dad43beabb80d768f81ec1f0f9b | \n", "Metagenome Raw Reads | \n", "nmdc:omprc-11-g8x3ed38 | \n", "https://data.microbiomedata.org/data/nmdc:ompr... | \n", "Metagenome Raw Reads for nmdc:omprc-11-g8x3ed38 | \n", "instrument_data | \n", "NaN | \n", "NaN | \n", "
| 2 | \n", "nmdc:dobj-11-00y67656 | \n", "nmdc:DataObject | \n", "52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz | \n", "1.097406e+10 | \n", "d56df876ceb5006d0d0546c8e67500ee | \n", "Metagenome Raw Reads | \n", "nmdc:omprc-11-3pn7ex35 | \n", "https://data.microbiomedata.org/data/nmdc:ompr... | \n", "Metagenome Raw Reads for nmdc:omprc-11-3pn7ex35 | \n", "instrument_data | \n", "NaN | \n", "NaN | \n", "
| 3 | \n", "nmdc:dobj-11-02dj2e39 | \n", "nmdc:DataObject | \n", "52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz | \n", "8.848502e+09 | \n", "9dd6311f11abe3a960796d2c227d9d19 | \n", "Metagenome Raw Reads | \n", "nmdc:omprc-11-dsv4yv97 | \n", "https://data.microbiomedata.org/data/nmdc:ompr... | \n", "Metagenome Raw Reads for nmdc:omprc-11-dsv4yv97 | \n", "instrument_data | \n", "NaN | \n", "NaN | \n", "
| 4 | \n", "nmdc:dobj-11-02n14844 | \n", "nmdc:DataObject | \n", "52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz | \n", "1.100929e+10 | \n", "1978fc04e0ce845651f0bdd4e0be1eb3 | \n", "Metagenome Raw Reads | \n", "nmdc:omprc-11-0w5g0a55 | \n", "https://data.microbiomedata.org/data/nmdc:ompr... | \n", "Metagenome Raw Reads for nmdc:omprc-11-0w5g0a55 | \n", "instrument_data | \n", "NaN | \n", "NaN | \n", "