Quick Start
Notebook-style examples, using real nmdc_api_utilities modules.
Example 1: Study -> Biosamples -> Data Objects
Goal: Start from an NMDC study name, retrieve matching study metadata records. Then collect all linked biosamples and data objects’ metadata records that are related to that study.
Step 1: Find matching study records
[1]:
from nmdc_api_utilities import StudySearch
study_client = StudySearch()
study_name = (
"Molecular mechanisms underlying changes in the temperature sensitive "
"respiration response of forest soils to long-term experimental warming"
)
studies = study_client.get_record_by_attribute(
attribute_name="name",
attribute_value=study_name,
exact_match=True,
)
print(f"Studies found: {len(studies)}")
if studies:
print(f"Study ID: {studies[0]['id']}")
Studies found: 1
Study ID: nmdc:sty-11-8ws97026
Step 2: Get linked biosamples
If at least one study is found, use the first study ID to request linked biosample records. The get_linked_instances method called like this will return a list of fully hydrated biosample metadata records. Note that the get_linked_instances method can be used to retrieve linked records of any type, not just biosamples, that are associated with any list of IDs (not just study IDs). It is available on both the StudySearch and NMDCSearch clients, as well as all the additional search
clients that are available in nmdc_api_utilities.
[2]:
biosamples = []
biosample_ids = []
if studies:
study_id = studies[0]["id"]
biosamples = study_client.get_linked_instances(
ids=[study_id],
types=["nmdc:Biosample"],
hydrate=True,
)
biosample_ids = [record["id"] for record in biosamples if "id" in record]
print(f"Biosamples found: {len(biosamples)}")
print(f"Biosample IDs collected: {len(biosample_ids)}")
print(f"Example biosample record: \n{biosamples[0] if biosamples else 'No biosamples found'}")
Biosamples found: 42
Biosample IDs collected: 42
Example biosample record:
{'id': 'nmdc:bsm-11-127y7152', 'type': 'nmdc:Biosample', '_downstream_of': ['nmdc:sty-11-8ws97026'], 'name': 'BW-H-2-O', 'associated_studies': ['nmdc:sty-11-8ws97026'], 'env_broad_scale': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'forest biome [ENVO:01000174]', 'term': {'id': 'ENVO:01000174', 'type': 'nmdc:OntologyClass', 'name': 'forest biome'}}, 'env_local_scale': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'organic horizon [ENVO:03600018]', 'term': {'id': 'ENVO:03600018', 'type': 'nmdc:OntologyClass', 'name': 'organic horizon'}}, 'env_medium': {'type': 'nmdc:ControlledIdentifiedTermValue', 'has_raw_value': 'heat stressed soil [ENVO:00005781]', 'term': {'id': 'ENVO:00005781', 'type': 'nmdc:OntologyClass', 'name': 'heat stressed soil'}}, 'samp_name': 'BW-H-2-O', 'collection_date': {'type': 'nmdc:TimestampValue', 'has_raw_value': '2017-05-24'}, 'depth': {'type': 'nmdc:QuantityValue', 'has_raw_value': '0 - .02', 'has_maximum_numeric_value': 0.02, 'has_minimum_numeric_value': 0.0, 'has_unit': 'm'}, 'ecosystem': 'Environmental', 'ecosystem_category': 'Terrestrial', 'ecosystem_subtype': 'Temperate forest', 'ecosystem_type': 'Soil', 'elev': 302.0, 'env_package': {'type': 'nmdc:TextValue', 'has_raw_value': 'soil'}, 'experimental_factor': {'type': 'nmdc:ControlledTermValue', 'has_raw_value': 'heat stress treatment [MCO:0000172]', 'term': {'id': 'MCO:0000172', 'type': 'nmdc:OntologyClass', 'name': 'heat stress treatment'}}, 'geo_loc_name': {'type': 'nmdc:TextValue', 'has_raw_value': 'USA: Massachusetts, Petersham'}, 'growth_facil': {'type': 'nmdc:ControlledTermValue', 'has_raw_value': 'field_incubation'}, 'lat_lon': {'type': 'nmdc:GeolocationValue', 'has_raw_value': '42.481016 -72.178343', 'latitude': 42.481016, 'longitude': -72.178343}, 'samp_store_temp': {'type': 'nmdc:QuantityValue', 'has_raw_value': '-80 Celsius', 'has_numeric_value': -80.0, 'has_unit': 'Cel'}, 'specific_ecosystem': 'O horizon/Organic', 'store_cond': {'type': 'nmdc:TextValue', 'has_raw_value': 'frozen'}, 'analysis_type': ['metatranscriptomics', 'natural organic matter', 'metaproteomics', 'metabolomics', 'lipidomics'], 'gold_biosample_identifiers': ['gold:Gb0158493']}
Step 3: Get linked data objects
Use biosample IDs as seeds for another linked-instances query targeting nmdc:DataObject. For this example, we’ll just pull from a subset (5) of the biosample IDs to avoid pulling too many records, but in practice you could pull all linked data objects if desired.
[3]:
data_objects = []
if biosample_ids:
data_objects = study_client.get_linked_instances(
ids=biosample_ids[0:5], # Just using the first 5 biosample IDs for this example
types=["nmdc:DataObject"],
hydrate=True,
)
print(f"Data objects found: {len(data_objects)}")
if data_objects:
print(f"\nExample data object \nID: {data_objects[0].get('id')}")
print(f"Name: {data_objects[0].get('name')}")
print(f"Description: {data_objects[0].get('description')}")
print(f"Link for downloading: {data_objects[0].get('url')}")
Data objects found: 444
Example data object
ID: nmdc:dobj-11-w29y0f79
Name: nmdc_wfmtan-11-ssf5tv34.1_prodigal.gff
Description: Prodigal Annotations nmdc:wfmtan-11-ssf5tv34.1
Link for downloading: https://data.microbiomedata.org/data/nmdc:dgns-11-cgnpxt22/nmdc:wfmtan-11-ssf5tv34.1/nmdc_wfmtan-11-ssf5tv34.1_prodigal.gff
Example 2: Data Object Type -> Data Objects -> Biosample Metadata
Goal: Start from a data_object_type value, retrieve matching data objects, then resolve associated biosample metadata. See schema documentation for more details on the data_object_type property and its allowed values: https://microbiomedata.github.io/nmdc-schema/FileTypeEnum/.
Step 1: Retrieve data objects by type
[4]:
import json
from nmdc_api_utilities import DataObjectSearch
data_object_client = DataObjectSearch()
data_object_type = "Metagenome Raw Reads"
filter_str = json.dumps({"data_object_type": data_object_type})
data_objects = data_object_client.get_record_by_filter(
filter=filter_str,
all_pages=False, # Set to True to retrieve all matching records across all pages of results,
shape="dataframe", # Set to "records" to return a list of dictionaries instead of a DataFrame
)
data_object_ids = data_objects.dropna(subset=["id"])["id"].tolist()
print(f"Data objects found: {len(data_objects)}")
print(f"Data object IDs collected: {len(data_object_ids)}")
data_objects.head()
Data objects found: 25
Data object IDs collected: 25
[4]:
| id | type | name | file_size_bytes | md5_checksum | data_object_type | was_generated_by | url | description | data_category | alternative_identifiers | in_manifest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-00pns528 | nmdc:DataObject | 52441.2.335479.CACGTTGT-ACAACGTG.fastq.gz | 1.002897e+10 | 0c70f5574024426432ea03eb3f130e01 | Metagenome Raw Reads | nmdc:omprc-11-rdvzce03 | https://data.microbiomedata.org/data/nmdc:ompr... | Metagenome Raw Reads for nmdc:omprc-11-rdvzce03 | instrument_data | NaN | NaN |
| 1 | nmdc:dobj-11-00xqnn30 | nmdc:DataObject | 12844.2.289969.GTTCAACC-GGTTGAAC.fastq.gz | 8.449439e+09 | c8b53dad43beabb80d768f81ec1f0f9b | Metagenome Raw Reads | nmdc:omprc-11-g8x3ed38 | https://data.microbiomedata.org/data/nmdc:ompr... | Metagenome Raw Reads for nmdc:omprc-11-g8x3ed38 | instrument_data | NaN | NaN |
| 2 | nmdc:dobj-11-00y67656 | nmdc:DataObject | 52437.1.333590.GAACGCTT-AAGCGTTC.fastq.gz | 1.097406e+10 | d56df876ceb5006d0d0546c8e67500ee | Metagenome Raw Reads | nmdc:omprc-11-3pn7ex35 | https://data.microbiomedata.org/data/nmdc:ompr... | Metagenome Raw Reads for nmdc:omprc-11-3pn7ex35 | instrument_data | NaN | NaN |
| 3 | nmdc:dobj-11-02dj2e39 | nmdc:DataObject | 52561.2.384837.AGTACCGT-CTAGACTG.fastq.gz | 8.848502e+09 | 9dd6311f11abe3a960796d2c227d9d19 | Metagenome Raw Reads | nmdc:omprc-11-dsv4yv97 | https://data.microbiomedata.org/data/nmdc:ompr... | Metagenome Raw Reads for nmdc:omprc-11-dsv4yv97 | instrument_data | NaN | NaN |
| 4 | nmdc:dobj-11-02n14844 | nmdc:DataObject | 52437.3.333700.ACCTCTGT-ACAGAGGT.fastq.gz | 1.100929e+10 | 1978fc04e0ce845651f0bdd4e0be1eb3 | Metagenome Raw Reads | nmdc:omprc-11-0w5g0a55 | https://data.microbiomedata.org/data/nmdc:ompr... | Metagenome Raw Reads for nmdc:omprc-11-0w5g0a55 | instrument_data | NaN | NaN |
Step 2: Resolve linked biosample IDs
Build a mapping from each data object ID to biosample IDs using linked instances.
[5]:
associations = {}
biosample_ids = []
if data_object_ids:
associations = data_object_client.get_linked_instances_and_associate_ids(
ids=data_object_ids,
types=["nmdc:Biosample"],
hydrate=False,
)
for data_object_id in data_object_ids:
associations.setdefault(data_object_id, [])
biosample_ids = sorted(
{
biosample_id
for linked_ids in associations.values()
for biosample_id in linked_ids
}
)
print(f"Objects with biosample links: {len(associations)}")
print(f"Unique biosample IDs: {len(biosample_ids)}")
Objects with biosample links: 25
Unique biosample IDs: 25
Step 3: Fetch biosample metadata and attach by data object
Retrieve biosample records and build a per-data-object metadata mapping. Similar to the get_linked_instances method, the get_records_by_id method is also available across multiple clients in nmdc_api_utilities and can be used to retrieve fully hydrated metadata records for any list of IDs, even if those IDs are not linked to each other or do not belong to a common collection.
[6]:
biosample_records = []
if biosample_ids:
biosample_records = data_object_client.get_records_by_id(ids=biosample_ids)
biosamples_by_id = {
record["id"]: record
for record in biosample_records
if "id" in record
}
biosamples_by_data_object = {}
for data_object_id in data_object_ids:
linked_ids = associations.get(data_object_id, [])
biosamples_by_data_object[data_object_id] = [
biosamples_by_id[biosample_id]
for biosample_id in linked_ids
if biosample_id in biosamples_by_id
]
print(f"Biosamples fetched: {len(biosample_records)}")
print(f"Objects with mapped biosamples: {len(biosamples_by_data_object)}")
Biosamples fetched: 25
Objects with mapped biosamples: 25
Example 3: Study -> Biosamples -> ++ Biosamples
Goal: Starting from a study of interest, increase the size of your database by searching for additional biosamples within a certain radius of those you’ve already found.
Step 1: Find biosamples from your study of interest
[7]:
from nmdc_api_utilities import StudySearch
study_client = StudySearch()
study_id = "nmdc:sty-11-8xdqsn54"
studies = study_client.get_record_by_attribute(
attribute_name="id",
attribute_value=study_id,
exact_match=True,
)
biosamples = study_client.get_linked_instances(
ids=[study_id],
types=["nmdc:Biosample"],
hydrate=True,
)
biosample_ids = [record["id"] for record in biosamples if "id" in record]
print(f"Biosamples from original study: {len(biosamples)}")
Biosamples from original study: 104
Step 2: Find additional biosamples within a radius of the original biosamples
[8]:
from nmdc_api_utilities import BiosampleSearch
biosample_client = BiosampleSearch()
add_biosamples = []
for biosample_id in biosample_ids:
new_biosamples = biosample_client.get_record_by_proximity(
radius_km=2000,
record_id=biosample_id,
)
for biosample in new_biosamples:
if biosample["id"] not in biosample_ids and biosample["id"] not in [b["id"] for b in add_biosamples]:
add_biosamples.append(biosample)
print(f"Additional biosamples found: {len(add_biosamples)}")
Additional biosamples found: 79
Search client selection and schema field names
The above examples use the DataObjectSearch and StudySearch clients because the initial filtering was targeted to a specific collection metadata records (data_object_set and study_set, respectively). To help orient yourself to which client to use for a given query, you can refer to the typecode-to-class map in the NMDC Schema documentation, which shows which schema classes are associated with each typecode and
therefore which clients will be able to filter by those schema fields.
For additional query recipes and MongoDB-style filters, see the Filters page in this documentation set.