Core Search API Reference
This page documents public, reusable core classes. Collection-specific subclasses are documented in CollectionSearch Subclasses.
NMDC Search Base
The foundational class for cross-collection queries and linked-instance retrieval. Use this class for custom workflows that span multiple schema classes.
- class nmdc_api_utilities.nmdc_search.NMDCSearch(api_base_url='https://api.microbiomedata.org', env='')[source]
Bases:
NMDCAPIClientClass for interacting with the NMDC Runtime API for searching and retrieving records in the NMDC metadata database.
- Parameters:
api_base_url (
str, default:'https://api.microbiomedata.org') – The base URL of an instance of the NMDC Runtime API. By default, this is the base URL of the production instance. NMDC team members will occasionally set this to the base URL of a different instance; for example, a self-hosted instance used for testing.
- get_collection_name_from_id(doc_id)[source]
Used when you have an id but not the collection name. Determine the collection the id is stored in.
- Parameters:
doc_id (
str) – The id of the document.- Returns:
The collection name of the document.
- Return type:
str- Raises:
RuntimeError – If the API request fails.
- get_linked_instances(ids, hydrate=False, types=None, max_page_size=500)[source]
Retrieve linked instances for the given IDs from the NMDC API.
This method returns a list of linked instance records for the given IDs. For instance, if you provide a study ID, this returns records from the
biosample_set,data_generation_set, etc. that are associated with that study, even if the association is not represented by a single direct link.See
get_linked_instances_and_associate_idsfor a method that returns an alternate format of the data.- Parameters:
ids (
list[str] |str) – The ids to search for.hydrate (
bool, default:False) – Whether to include full documents in the response.types (
list[str] |str|None, default:None) – The types of records to return. If omitted orNone, linked instances of all types are returned. Example: [“nmdc:Study”, “nmdc:Biosample”, “nmdc:MassSpectrometry”].max_page_size (
int, default:500) – The maximum number of records to return per page.
- Returns:
A list of linked instance records.
- Return type:
list[dict]
- get_linked_instances_and_associate_ids(ids, types=None, hydrate=False, max_page_size=500)[source]
Retrieve linked instances for the given IDs from the NMDC API and associate them with the input IDs.
This method returns a list of records that are linked to the records with the given IDs. For instance, if you provide an ID for a study record, this can return the ids records within the
biosample_set,data_generation_setetc that are associated with this study, even if it is not a single link between records.See also
get_linked_instancesfor a method that returns the linked instances in their original list format. This method reformats into a dictionary with keys as query ids, and either a list of resulting linked ids or a list of hydrated records as values.- Parameters:
ids (
list[str] |str) – The ids to search for.types (
list[str] |str|None, default:None) – The types of instances you want to return. Iftypesis None, all types are returned.hydrate (
bool, default:False) – Whether to include full documents in the response.max_page_size (
int, default:500) – The maximum number of records to return per page.
- Returns:
A dictionary mapping each input id to a list of its linked instance records.
- Return type:
dict[str,list[dict|str]]
- get_record_from_id(id, filter='', fields='')[source]
Retrieve a record via the NMDC API from a provided record ID.
- Parameters:
id (
str) – The ID of the record to retrieve.filter (
str, default:'') – Additional filter to apply to the records. If empty, no additional filter is applied.fields (
str, default:'') – Comma-separated list of fields to include in the response. If empty, all fields are returned.
- Returns:
The full record data.
- Return type:
dict
- get_records_by_id(ids, fields='')[source]
Retrieve records via the NMDC API from a provided list of record IDs.
The input ids can be from multiple collections. Input like [“nmdc:sty-11-8fb6t785”, “nmdc:bsm-11-002vgm56”, “nmdc:dobj-11-00095294”] is valid and will return each of these records in a list of dictionaries.
- Parameters:
ids (
list[str] |str) – List of IDs of records to retrieve.fields (
str, default:'') – Comma-separated list of fields to include in the response. An empty string returns all fields.
- Returns:
The record(s) data.
- Return type:
list[dict]
Collection Search Base
Extends NMDCSearch with collection-focused query helpers.
Use this class for generic collection operations, or use CollectionSearch Subclasses for preconfigured collection targets.
- class nmdc_api_utilities.collection_search.CollectionSearch(collection_name, api_base_url='https://api.microbiomedata.org', env='')[source]
Bases:
NMDCSearchClass to interact with the NMDC API to search for records within a specified collection.
- Parameters:
collection_name (
str) – The name of the collection to search within.api_base_url (
str, default:'https://api.microbiomedata.org') – The base URL of an instance of the NMDC Runtime API. By default, this is the base URL of the production instance.
- check_ids_exist(ids, chunk_size=100, return_missing_ids=False)[source]
Check if specified IDs exist in the collection.
This method constructs a query to the API to filter the collection based on the given IDs, and checks if all IDs exist in the collection.
- Parameters:
ids (
list[str]) – A list of IDs to check if they exist in the collection.chunk_size (
int, default:100) – The number of IDs to check in each query.return_missing_ids (
bool, default:False) – If True, and if ids are missing in the collection, return the list of IDs that do not exist in the collection.
- Returns:
True if all IDs exist in the collection, False otherwise. However, if return_missing_ids is True, returns a tuple whose first item is the aforementioned boolean value, and whose second item is a list of the IDs, if any, that don’t exist in the collection.
- Return type:
bool|tuple[bool,list[str]]
- get_batch_records(id_list, search_field, chunk_size=100, fields='')[source]
Get a batch of records from the collection that relate to input IDs.
This method is used to retrieve records that include any of the IDs from the input list in specified fields (including fields other than
id). For example, if records in a collection contain study IDs in a field calledassociated_studies, this method can be used to retrieve all records that include any of the input study IDs in theassociated_studiesfield.- Parameters:
id_list (
list) – A list of IDs to get records for.search_field (
str) – The field in which to search for the IDs.chunk_size (
int, default:100) – The number of IDs to get in each query.fields (
str, default:'') – The fields to return. If empty or not provided, all fields are returned.
- Returns:
A list of dictionaries containing the records.
- Return type:
list[dict]
- get_record_by_attribute(attribute_name, attribute_value, max_page_size=25, fields='', all_pages=False, exact_match=False)[source]
Retrieve a record via the NMDC API by a specific attribute’s value.
- Parameters:
attribute_name (
str) – The name of the attribute to filter by.attribute_value (
str) – The value of the attribute to filter by.max_page_size (
int, default:25) – The number of records to return per page.fields (
str, default:'') – The fields to return. If empty, all fields are returned.all_pages (
bool, default:False) – True to return all pages. False to return the first page.exact_match (
bool, default:False) – Whether the attribute value should be matched exactly or partially. Used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match.
- Returns:
A list of dictionaries containing the records.
- Return type:
list[dict]
- get_record_by_filter(filter, max_page_size=25, fields='', all_pages=False)[source]
Retrieve a record via the NMDC API using a specified filter.
- Parameters:
filter (
str) – The filter to use to query the collection. Must be in MongoDB query format. Example: {“name”:”my record name”}. More resources for constructing MongoDB filters can be found here.max_page_size (
int, default:25) – The number of records to return per page.fields (
str, default:'') – The fields to return. Default will return all fields. Example: “id,name,description,url,type”all_pages (
bool, default:False) – True to return all pages. False to return the first page.
- Returns:
A list of dictionaries containing the records.
- Return type:
list[dict]
- get_record_by_id(record_id=None, max_page_size=100, fields='', collection_id=None)[source]
Retrieve a record from the collection via the NMDC API using a specified ID.
- Parameters:
record_id (
Optional[str], default:None) – The id of the record to retrieve from the collection. Not required to enable backwards compatibility with the deprecated collection_id parameter.max_page_size (
int, default:100) – The maximum number of records to return per page. Default is 100.fields (
str, default:'') – The fields to return. Default is all fields.collection_id (
Optional[str], default:None) – The id of the record to retrieve from the collection. This parameter is deprecated and will be removed in a future version. Please use record_id instead.
- Returns:
A list of dictionaries containing the records.
- Return type:
list[dict]- Raises:
RuntimeError – If the API request fails.
- get_records(filter='', max_page_size=100, fields='', all_pages=False)[source]
Retrieve records from the collection via the NMDC API.
- Parameters:
filter (
str, default:'') – The filter to apply to the query. An empty string will return all records.max_page_size (
int, default:100) – The maximum number of records to return per page.fields (
str, default:'') – The fields to return. An empty string will return all fields.all_pages (
bool, default:False) – True to return all pages. False to return the first page.
- Returns:
A list of dictionaries containing the records.
- Return type:
list[dict]- Raises:
RuntimeError – If the API request fails.
Functional Search
Provides search utilities focused on functional annotation and related retrieval patterns.
- class nmdc_api_utilities.functional_search.FunctionalSearch(api_base_url='https://api.microbiomedata.org', env='')[source]
Bases:
CollectionSearchClass to interact with the NMDC API to search for records within the
functional_annotation_aggcollection.- get_functional_annotations(annotation, annotation_type, page_size=25, fields='', all_pages=False)[source]
Retrieve records with specific annotation value and type.
- Parameters:
annotation (
str) – The functional annotation value to query.annotation_type (
str) – The type of id to query. See Notes for more details.page_size (
int, default:25) – The number of results to return per page.fields (
str, default:'') – The fields to return. If empty, all fields are returned. Example: “id,name”all_pages (
bool, default:False) – True to return all pages. False to return the first page.
- Returns:
A list of functional annotations.
- Return type:
list[dict]- Raises:
ValueError – If the annotation_type is not one of the allowed types. See Notes for more details.
Notes
The
annotation_typemust be one of the following: “KEGG”, “COG”, “PFAM”.
- supports_get_by_id = False
Latitude and Longitude Utilities
Provides geospatial helper methods for lat/lon-based filtering and coordinate handling.
- class nmdc_api_utilities.lat_long_filters.LatLongFilters[source]
Bases:
ABCMixin class with methods to interact with collections that can be searched by latitude and longitude via the NMDC API.
- get_record_by_lat_long(lat_comparison, long_comparison, latitude, longitude, page_size=25, fields='', all_pages=False)[source]
Retrieve records by latitude and longitude filters via the NMDC API.
- Parameters:
lat_comparison (
str) – The comparison to use to query the record for latitude. See Notes for more details.long_comparison (
str) – The comparison to use to query the record for longitude. See Notes for more details.latitude (
float) – The latitude of the record to query.longitude (
float) – The longitude of the record to query.page_size (
int, default:25) – The number of results to return per page.fields (
str, default:'') – The fields to return. If empty, all fields are returned. Example: “id,name,description,type”all_pages (
bool, default:False) – True to return all pages. False to return the first page.
- Returns:
A list of records.
- Return type:
list[dict]- Raises:
ValueError – If the comparison is not one of the allowed comparisons.
Notes
lat_comparisonandlong_comparisonmust be one of the following:eq : Matches values that are equal to the given value.
gt : Matches if values are greater than the given value.
lt : Matches if values are less than the given value.
gte : Matches if values are greater or equal to the given value.
lte : Matches if values are less or equal to the given value.
- get_record_by_latitude(comparison, latitude, page_size=25, fields='', all_pages=False)[source]
Retrieve records by latitude filter via the NMDC API.
- Parameters:
comparison (
str) – The comparison to use to query the record. See Notes for more details.latitude (
float) – The latitude of the record to query.page_size (
int, default:25) – The number of results to return per page.fields (
str, default:'') – The fields to return. Default is all fields. Example: “id,name,description,type”all_pages (
bool, default:False) – True to return all pages. False to return the first page.
- Returns:
A list of records.
- Return type:
list[dict]- Raises:
ValueError – If the comparison is not one of the allowed comparisons.
Notes
The
comparisonmust be one of the following: “eq”, “gt”, “lt”, “gte”, “lte”.eq : Matches values that are equal to the given value.
gt : Matches if values are greater than the given value.
lt : Matches if values are less than the given value.
gte : Matches if values are greater or equal to the given value.
lte : Matches if values are less or equal to the given value.
- get_record_by_longitude(comparison, longitude, page_size=25, fields='', all_pages=False)[source]
Retrieve records by longitude filter via the NMDC API.
- Parameters:
comparison (
str) – The comparison to use to query the record. See Notes for more details.longitude (
float) – The longitude of the record to query.page_size (
int, default:25) – The number of results to return per page.fields (
str, default:'') – The fields to return. If empty, all fields are returned. Example: “id,name,description,type”all_pages (
bool, default:False) – True to return all pages. False to return the first page.
- Returns:
A list of records.
- Return type:
list[dict]- Raises:
ValueError – If the comparison is not one of the allowed comparisons.
Notes
The
comparisonmust be one of the following: “eq”, “gt”, “lt”, “gte”, “lte”.eq : Matches values that are equal to the given value.
gt : Matches if values are greater than the given value.
lt : Matches if values are less than the given value.
gte : Matches if values are greater or equal to the given value.
lte : Matches if values are less or equal to the given value.
Data Processing Utilities
Provides helpers for transforming and reshaping query outputs.
- class nmdc_api_utilities.data_processing.DataProcessing[source]
Bases:
object- build_filter(attributes, exact_match=False)[source]
Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.
- Parameters:
attributes (
dict[str,str]) – Dictionary of attribute names and their corresponding values to match using regex. Example: {“name”: “example”, “description”: “example”, “geo_loc_name”: “example”}exact_match (
bool, default:False) – This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.
- Returns:
A string representing the MongoDB filter.
- Return type:
str
- convert_to_df(data)[source]
Convert a list of dictionaries to a pandas dataframe.
- Parameters:
data (
list[dict[str,Any]]) – A list of dictionaries.- Returns:
A pandas dataframe representation of the input dictionaries.
- Return type:
DataFrame
- extract_field(api_results, field_name)[source]
Extract a specific field’s values from records retrieved via the NMDC API.
- Parameters:
api_results (
list[dict[str,Any]]) – A list of dictionaries.field_name (
str) – The name of the field to extract.
- Returns:
A list of values for the specified field.
- Return type:
list[Any]
- merge_dataframes(column, df1, df2)[source]
Merge two dataframes.
Wrapper around
pandas.mergeto merge two dataframes on a specified column using an inner join.- Parameters:
column (
str) – The column to merge on.df1 (
DataFrame) – The first dataframe to merge.df2 (
DataFrame) – The second dataframe to merge.
- Returns:
A pandas dataframe with the merged data.
- Return type:
DataFrame
- merge_df(df1, df2, key1, key2)[source]
Merges two dataframes using an inner join based on specified keys, automatically exploding list-like columns and removing duplicates.
Helpful for merging two sets of dataframe results obtained from the
convert_to_dfmethod.- Parameters:
df1 (
DataFrame) – The first dataframe to merge.df2 (
DataFrame) – The second dataframe to merge.key1 (
str) – The key in df1 to match with key2 in df2.key2 (
str) – The key in df2 to match with key1 in df1.
- Returns:
A pandas dataframe with the merged data.
- Return type:
DataFrame
- rename_columns(df, new_col_names)[source]
Rename columns in a pandas dataframe.
- Parameters:
df (
DataFrame) – The pandas dataframe to rename columns.new_col_names (
list[str]) –A list of new column names. Names MUST be in order of the columns in the dataframe. Example:
If the current column names are - [‘old_col1’, ‘old_col2’, ‘old_col3’] You will need to pass in the new names like - [‘new_col1’, ‘new_col2’, ‘new_col3’]
- Returns:
A pandas dataframe with renamed columns.
- Return type:
DataFrame
General Utilities
General-purpose helper utilities used across workflows.