Core Search API Reference

This page documents public, reusable core classes. Collection-specific subclasses are documented in CollectionSearch Subclasses.

NMDC Search Base

The foundational class for cross-collection queries and linked-instance retrieval. Use this class for custom workflows that span multiple schema classes.

class nmdc_api_utilities.nmdc_search.NMDCSearch(api_base_url='https://api.microbiomedata.org', env='')[source]

Bases: NMDCAPIClient

Class for interacting with the NMDC Runtime API for searching and retrieving records in the NMDC metadata database.

Parameters:

api_base_url (str, default: 'https://api.microbiomedata.org') – The base URL of an instance of the NMDC Runtime API. By default, this is the base URL of the production instance. NMDC team members will occasionally set this to the base URL of a different instance; for example, a self-hosted instance used for testing.

get_collection_name_from_id(doc_id)[source]

Used when you have an id but not the collection name. Determine the collection the id is stored in.

Parameters:

doc_id (str) – The id of the document.

Returns:

The collection name of the document.

Return type:

str

Raises:

RuntimeError – If the API request fails.

get_linked_instances(ids, hydrate=False, types=None, max_page_size=500)[source]

Retrieve linked instances for the given IDs from the NMDC API.

This method returns a list of linked instance records for the given IDs. For instance, if you provide a study ID, this returns records from the biosample_set, data_generation_set, etc. that are associated with that study, even if the association is not represented by a single direct link.

See get_linked_instances_and_associate_ids for a method that returns an alternate format of the data.

Parameters:
  • ids (list[str] | str) – The ids to search for.

  • hydrate (bool, default: False) – Whether to include full documents in the response.

  • types (list[str] | str | None, default: None) – The types of records to return. If omitted or None, linked instances of all types are returned. Example: [“nmdc:Study”, “nmdc:Biosample”, “nmdc:MassSpectrometry”].

  • max_page_size (int, default: 500) – The maximum number of records to return per page.

Returns:

A list of linked instance records.

Return type:

list[dict]

get_linked_instances_and_associate_ids(ids, types=None, hydrate=False, max_page_size=500)[source]

Retrieve linked instances for the given IDs from the NMDC API and associate them with the input IDs.

This method returns a list of records that are linked to the records with the given IDs. For instance, if you provide an ID for a study record, this can return the ids records within the biosample_set, data_generation_set etc that are associated with this study, even if it is not a single link between records.

See also get_linked_instances for a method that returns the linked instances in their original list format. This method reformats into a dictionary with keys as query ids, and either a list of resulting linked ids or a list of hydrated records as values.

Parameters:
  • ids (list[str] | str) – The ids to search for.

  • types (list[str] | str | None, default: None) – The types of instances you want to return. If types is None, all types are returned.

  • hydrate (bool, default: False) – Whether to include full documents in the response.

  • max_page_size (int, default: 500) – The maximum number of records to return per page.

Returns:

A dictionary mapping each input id to a list of its linked instance records.

Return type:

dict[str, list[dict | str]]

get_record_from_id(id, filter='', fields='')[source]

Retrieve a record via the NMDC API from a provided record ID.

Parameters:
  • id (str) – The ID of the record to retrieve.

  • filter (str, default: '') – Additional filter to apply to the records. If empty, no additional filter is applied.

  • fields (str, default: '') – Comma-separated list of fields to include in the response. If empty, all fields are returned.

Returns:

The full record data.

Return type:

dict

get_records_by_id(ids, fields='')[source]

Retrieve records via the NMDC API from a provided list of record IDs.

The input ids can be from multiple collections. Input like [“nmdc:sty-11-8fb6t785”, “nmdc:bsm-11-002vgm56”, “nmdc:dobj-11-00095294”] is valid and will return each of these records in a list of dictionaries.

Parameters:
  • ids (list[str] | str) – List of IDs of records to retrieve.

  • fields (str, default: '') – Comma-separated list of fields to include in the response. An empty string returns all fields.

Returns:

The record(s) data.

Return type:

list[dict]

get_schema_version()[source]

Get the current NMDC schema version used by the NMDC API.

Returns:

The NMDC schema version

Return type:

str

Collection Search Base

Extends NMDCSearch with collection-focused query helpers. Use this class for generic collection operations, or use CollectionSearch Subclasses for preconfigured collection targets.

class nmdc_api_utilities.collection_search.CollectionSearch(collection_name, api_base_url='https://api.microbiomedata.org', env='')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to search for records within a specified collection.

Parameters:
  • collection_name (str) – The name of the collection to search within.

  • api_base_url (str, default: 'https://api.microbiomedata.org') – The base URL of an instance of the NMDC Runtime API. By default, this is the base URL of the production instance.

check_ids_exist(ids, chunk_size=100, return_missing_ids=False)[source]

Check if specified IDs exist in the collection.

This method constructs a query to the API to filter the collection based on the given IDs, and checks if all IDs exist in the collection.

Parameters:
  • ids (list[str]) – A list of IDs to check if they exist in the collection.

  • chunk_size (int, default: 100) – The number of IDs to check in each query.

  • return_missing_ids (bool, default: False) – If True, and if ids are missing in the collection, return the list of IDs that do not exist in the collection.

Returns:

True if all IDs exist in the collection, False otherwise. However, if return_missing_ids is True, returns a tuple whose first item is the aforementioned boolean value, and whose second item is a list of the IDs, if any, that don’t exist in the collection.

Return type:

bool | tuple[bool, list[str]]

get_batch_records(id_list, search_field, chunk_size=100, fields='')[source]

Get a batch of records from the collection that relate to input IDs.

This method is used to retrieve records that include any of the IDs from the input list in specified fields (including fields other than id). For example, if records in a collection contain study IDs in a field called associated_studies, this method can be used to retrieve all records that include any of the input study IDs in the associated_studies field.

Parameters:
  • id_list (list) – A list of IDs to get records for.

  • search_field (str) – The field in which to search for the IDs.

  • chunk_size (int, default: 100) – The number of IDs to get in each query.

  • fields (str, default: '') – The fields to return. If empty or not provided, all fields are returned.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

get_record_by_attribute(attribute_name, attribute_value, max_page_size=25, fields='', all_pages=False, exact_match=False)[source]

Retrieve a record via the NMDC API by a specific attribute’s value.

Parameters:
  • attribute_name (str) – The name of the attribute to filter by.

  • attribute_value (str) – The value of the attribute to filter by.

  • max_page_size (int, default: 25) – The number of records to return per page.

  • fields (str, default: '') – The fields to return. If empty, all fields are returned.

  • all_pages (bool, default: False) – True to return all pages. False to return the first page.

  • exact_match (bool, default: False) – Whether the attribute value should be matched exactly or partially. Used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

get_record_by_filter(filter, max_page_size=25, fields='', all_pages=False)[source]

Retrieve a record via the NMDC API using a specified filter.

Parameters:
  • filter (str) – The filter to use to query the collection. Must be in MongoDB query format. Example: {“name”:”my record name”}. More resources for constructing MongoDB filters can be found here.

  • max_page_size (int, default: 25) – The number of records to return per page.

  • fields (str, default: '') – The fields to return. Default will return all fields. Example: “id,name,description,url,type”

  • all_pages (bool, default: False) – True to return all pages. False to return the first page.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

get_record_by_id(record_id=None, max_page_size=100, fields='', collection_id=None)[source]

Retrieve a record from the collection via the NMDC API using a specified ID.

Parameters:
  • record_id (Optional[str], default: None) – The id of the record to retrieve from the collection. Not required to enable backwards compatibility with the deprecated collection_id parameter.

  • max_page_size (int, default: 100) – The maximum number of records to return per page. Default is 100.

  • fields (str, default: '') – The fields to return. Default is all fields.

  • collection_id (Optional[str], default: None) – The id of the record to retrieve from the collection. This parameter is deprecated and will be removed in a future version. Please use record_id instead.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

Raises:

RuntimeError – If the API request fails.

get_records(filter='', max_page_size=100, fields='', all_pages=False)[source]

Retrieve records from the collection via the NMDC API.

Parameters:
  • filter (str, default: '') – The filter to apply to the query. An empty string will return all records.

  • max_page_size (int, default: 100) – The maximum number of records to return per page.

  • fields (str, default: '') – The fields to return. An empty string will return all fields.

  • all_pages (bool, default: False) – True to return all pages. False to return the first page.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

Raises:

RuntimeError – If the API request fails.

Latitude and Longitude Utilities

Provides geospatial helper methods for lat/lon-based filtering and coordinate handling.

class nmdc_api_utilities.lat_long_filters.LatLongFilters[source]

Bases: ABC

Mixin class with methods to interact with collections that can be searched by latitude and longitude via the NMDC API.

get_record_by_lat_long(lat_comparison, long_comparison, latitude, longitude, page_size=25, fields='', all_pages=False)[source]

Retrieve records by latitude and longitude filters via the NMDC API.

Parameters:
  • lat_comparison (str) – The comparison to use to query the record for latitude. See Notes for more details.

  • long_comparison (str) – The comparison to use to query the record for longitude. See Notes for more details.

  • latitude (float) – The latitude of the record to query.

  • longitude (float) – The longitude of the record to query.

  • page_size (int, default: 25) – The number of results to return per page.

  • fields (str, default: '') – The fields to return. If empty, all fields are returned. Example: “id,name,description,type”

  • all_pages (bool, default: False) – True to return all pages. False to return the first page.

Returns:

A list of records.

Return type:

list[dict]

Raises:

ValueError – If the comparison is not one of the allowed comparisons.

Notes

lat_comparison and long_comparison must be one of the following:

  • eq : Matches values that are equal to the given value.

  • gt : Matches if values are greater than the given value.

  • lt : Matches if values are less than the given value.

  • gte : Matches if values are greater or equal to the given value.

  • lte : Matches if values are less or equal to the given value.

get_record_by_latitude(comparison, latitude, page_size=25, fields='', all_pages=False)[source]

Retrieve records by latitude filter via the NMDC API.

Parameters:
  • comparison (str) – The comparison to use to query the record. See Notes for more details.

  • latitude (float) – The latitude of the record to query.

  • page_size (int, default: 25) – The number of results to return per page.

  • fields (str, default: '') – The fields to return. Default is all fields. Example: “id,name,description,type”

  • all_pages (bool, default: False) – True to return all pages. False to return the first page.

Returns:

A list of records.

Return type:

list[dict]

Raises:

ValueError – If the comparison is not one of the allowed comparisons.

Notes

The comparison must be one of the following: “eq”, “gt”, “lt”, “gte”, “lte”.

  • eq : Matches values that are equal to the given value.

  • gt : Matches if values are greater than the given value.

  • lt : Matches if values are less than the given value.

  • gte : Matches if values are greater or equal to the given value.

  • lte : Matches if values are less or equal to the given value.

get_record_by_longitude(comparison, longitude, page_size=25, fields='', all_pages=False)[source]

Retrieve records by longitude filter via the NMDC API.

Parameters:
  • comparison (str) – The comparison to use to query the record. See Notes for more details.

  • longitude (float) – The longitude of the record to query.

  • page_size (int, default: 25) – The number of results to return per page.

  • fields (str, default: '') – The fields to return. If empty, all fields are returned. Example: “id,name,description,type”

  • all_pages (bool, default: False) – True to return all pages. False to return the first page.

Returns:

A list of records.

Return type:

list[dict]

Raises:

ValueError – If the comparison is not one of the allowed comparisons.

Notes

The comparison must be one of the following: “eq”, “gt”, “lt”, “gte”, “lte”.

  • eq : Matches values that are equal to the given value.

  • gt : Matches if values are greater than the given value.

  • lt : Matches if values are less than the given value.

  • gte : Matches if values are greater or equal to the given value.

  • lte : Matches if values are less or equal to the given value.

abstractmethod get_records(filter='', max_page_size=100, fields='', all_pages=False)[source]

Retrieve records from a collection via the NMDC API.

Return type:

list[dict]

Data Processing Utilities

Provides helpers for transforming and reshaping query outputs.

class nmdc_api_utilities.data_processing.DataProcessing[source]

Bases: object

build_filter(attributes, exact_match=False)[source]

Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.

Parameters:
  • attributes (dict[str, str]) – Dictionary of attribute names and their corresponding values to match using regex. Example: {“name”: “example”, “description”: “example”, “geo_loc_name”: “example”}

  • exact_match (bool, default: False) – This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.

Returns:

A string representing the MongoDB filter.

Return type:

str

convert_to_df(data)[source]

Convert a list of dictionaries to a pandas dataframe.

Parameters:

data (list[dict[str, Any]]) – A list of dictionaries.

Returns:

A pandas dataframe representation of the input dictionaries.

Return type:

DataFrame

extract_field(api_results, field_name)[source]

Extract a specific field’s values from records retrieved via the NMDC API.

Parameters:
  • api_results (list[dict[str, Any]]) – A list of dictionaries.

  • field_name (str) – The name of the field to extract.

Returns:

A list of values for the specified field.

Return type:

list[Any]

merge_dataframes(column, df1, df2)[source]

Merge two dataframes.

Wrapper around pandas.merge to merge two dataframes on a specified column using an inner join.

Parameters:
  • column (str) – The column to merge on.

  • df1 (DataFrame) – The first dataframe to merge.

  • df2 (DataFrame) – The second dataframe to merge.

Returns:

A pandas dataframe with the merged data.

Return type:

DataFrame

merge_df(df1, df2, key1, key2)[source]

Merges two dataframes using an inner join based on specified keys, automatically exploding list-like columns and removing duplicates.

Helpful for merging two sets of dataframe results obtained from the convert_to_df method.

Parameters:
  • df1 (DataFrame) – The first dataframe to merge.

  • df2 (DataFrame) – The second dataframe to merge.

  • key1 (str) – The key in df1 to match with key2 in df2.

  • key2 (str) – The key in df2 to match with key1 in df1.

Returns:

A pandas dataframe with the merged data.

Return type:

DataFrame

rename_columns(df, new_col_names)[source]

Rename columns in a pandas dataframe.

Parameters:
  • df (DataFrame) – The pandas dataframe to rename columns.

  • new_col_names (list[str]) –

    A list of new column names. Names MUST be in order of the columns in the dataframe. Example:

    If the current column names are - [‘old_col1’, ‘old_col2’, ‘old_col3’] You will need to pass in the new names like - [‘new_col1’, ‘new_col2’, ‘new_col3’]

Returns:

A pandas dataframe with renamed columns.

Return type:

DataFrame

split_list(input_list, chunk_size=100)[source]

Split a list into chunks of a specified size.

Parameters:
  • input_list (list[Any]) – The list to split.

  • chunk_size (int, default: 100) – The size of each chunk.

Returns:

list

Return type:

list[list[Any]]

General Utilities

General-purpose helper utilities used across workflows.

class nmdc_api_utilities.utils.Utils[source]

Bases: object