NMDC API Utilities Documentation

Welcome to NMDC API Utilities documentation. This package provides tools for interacting with the NMDC API.

The Collection Module is a foundational component that defines common behaviors and properties between collections.

Each subclass is designed to be more user-friendly and specific for certain collections, making them the recommended entry points for using the package. Each function of CollectionSearch can be accessed via each subclass.

NMDC Module

class nmdc_api_utilities.nmdc_search.NMDCSearch(env='prod')[source]

Bases: object

Base class for interacting with the NMDC API. Sets the base URL for the API based on the environment. Environment is defaulted to the production isntance of the API. This functionality is in place for monthly testing of the runtime updates to the API.

Parameters:

env (str) –

The environment to use. Default is prod. Must be one of the following:

prod dev

Collection Module

class nmdc_api_utilities.collection_search.CollectionSearch(collection_name, env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to get collections of data. Must know the collection name to query.

check_ids_exist(ids: list, chunk_size: int = 100, return_missing_ids: bool = False) bool[source]

Check if the IDs exist in the collection.

This method constructs a query to the API to filter the collection based on the given IDs, and checks if all IDs exist in the collection.

Parameters:
  • ids (list) – A list of IDs to check if they exist in the collection.

  • chunk_size (int) – The number of IDs to check in each query. Default is 100.

  • return_missing_ids (bool) – If True, and if ids are missing in the collection, return the list of IDs that do not exist in the collection. Default is False.

Returns:

True if all IDs exist in the collection, False otherwise.

Return type:

bool

get_batch_records(id_list: list, search_field: str, chunk_size=100, fields='') list[dict][source]

Get a batch of records from the collection by a list of input IDs. This method is used to identify records that include any of the IDs from the input list, matching the search_field. This is using the MongoDB filter keyword $in to identify other records that include the input IDs.

Parameters:
  • id_list (list) – A list of IDs to get records for.

  • search_field (str) – The field to search for. This must match a field from the NMDC Schema.

  • chunk_size (int) – The number of IDs to get in each query. Default is 100.

  • fields (str) – The fields to return. Default is all fields.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

get_record_by_attribute(attribute_name: str, attribute_value: str, max_page_size: int = 25, fields: str = '', all_pages: bool = False, exact_match: bool = False)[source]

Get a record from the NMDC API by its name. Records can be filtered based on their attributes found https://microbiomedata.github.io/nmdc-schema/.

Parameters:
  • attribute_name (str) – The name of the attribute to filter by.

  • attribute_value (str) – The value of the attribute to filter by.

  • max_page_size (int) – The number of results to return per page. Default is 25.

  • fields (str) – The fields to return. Default is all fields.

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

  • exact_match (bool) – This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

get_record_by_filter(filter: str, max_page_size=25, fields: str = '', all_pages=False) list[dict][source]

Get a record from the NMDC API by its id.

Parameters:
  • filter (str) –

    The filter to use to query the collection. Must be in MonogDB query format.

    Resources found here - https://www.mongodb.com/docs/manual/reference/method/db.collection.find/#std-label-method-find-query

    Example: {“name”:{“my record name”}}

  • max_page_size (int) – The number of results to return per page. Default is 25.

  • fields (str) – The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

get_record_by_id(collection_id: str, max_page_size: int = 100, fields: str = '') list[dict][source]

Get a collection of data from the NMDC API by id.

Parameters:
  • collection_id (str) – The id of the collection.

  • max_page_size (int) – The maximum number of items to return per page. Default is 100.

  • fields (str) – The fields to return. Default is all fields.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

Raises:

RuntimeError – If the API request fails.

get_records(filter: str = '', max_page_size: int = 100, fields: str = '', all_pages: bool = False) list[dict][source]

Get a collection of data from the NMDC API. Generic function to get a collection of data from the NMDC API. Can provide a specific filter if desired.

Parameters:
  • filter (str) – The filter to apply to the query. Default is an empty string.

  • max_page_size (int) – The maximum number of items to return per page. Default is 100.

  • fields (str) – The fields to return. Default is all fields.

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

Returns:

A list of dictionaries containing the records.

Return type:

list[dict]

Raises:

RuntimeError – If the API request fails.

Latitude Longitude Module

class nmdc_api_utilities.lat_long_filters.LatLongFilters(collection_name, env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to filter sets by latitude and longitude.

get_record_by_lat_long(lat_comparison: str, long_comparison: str, latitude: float, longitude: float, page_size: int = 25, fields: str = '', all_pages: bool = False) list[dict][source]

Get a record from the NMDC API by latitude and longitude comparison.

Parameters:
  • lat_comparison (str) –

    The comparison to use to query the record for latitude. MUST BE ONE OF THE FOLLOWING:

    eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

  • long_comparison (str) –

    The comparison to use to query the record for longitude. MUST BE ONE OF THE FOLLOWING:

    eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

  • latitude (float) – The latitude of the record to query.

  • longitude (float) – The longitude of the record to query.

  • page_size (int) – The number of results to return per page. Default is 25.

  • fields (str) – The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

Returns:

A list of records.

Return type:

list[dict]

Raises:

ValueError – If the comparison is not one of the allowed comparisons.

get_record_by_latitude(comparison: str, latitude: float, page_size=25, fields='', all_pages=False)[source]

Get a record from the NMDC API by latitude comparison.

Parameters:
  • comparison (str) –

    The comparison to use to query the record. MUST BE ONE OF THE FOLLOWING:

    eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

  • latitude (float) – The latitude of the record to query.

  • page_size (int) – The number of results to return per page. Default is 25.

  • fields (str) – The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

Returns:

A list of records.

Return type:

list[dict]

Raises:

ValueError – If the comparison is not one of the allowed comparisons.

get_record_by_longitude(comparison: str, longitude: float, page_size: int = 25, fields: str = '', all_pages: bool = False) list[dict][source]

Get a record from the NMDC API by longitude comparison.

Parameters:
  • comparison (str) –

    The comparison to use to query the record. MUST BE ONE OF THE FOLLOWING:

    eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

  • longitude (float) – The longitude of the record to query.

  • page_size (int) – The number of results to return per page. Default is 25.

  • fields (str) – The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

Returns:

A list of records.

Return type:

list[dict]

Raises:

ValueError – If the comparison is not one of the allowed comparisons.

Functional Search Module

class nmdc_api_utilities.functional_search.FunctionalSearch(env='prod')[source]

Bases: object

Class to interact with the NMDC API to filter functional annotations by KEGG, COG, or PFAM ids.

get_functional_annotations(annotation: str, annotation_type: str, page_size: int = 25, fields: str = '', all_pages: bool = False) list[dict][source]

Get a record from the NMDC API by id. ID types can be KEGG, COG, or PFAM.

Parameters:
  • annotation (str) – The data base id to query the function annotations.

  • annotation_type

    The type of id to query. MUST be one of the following:

    KEGG COG PFAM

  • page_size (int) – The number of results to return per page. Default is 25.

  • fields (str) – The fields to return. Default is all fields. Example: “id,name”

  • all_pages (bool) – True to return all pages. False to return the first page. Default is False.

Returns:

A list of functional annotations.

Return type:

list[dict]

get_records(filter: str = '', max_page_size: int = 100, fields: str = '', all_pages: bool = False) list[dict][source]

Get a collection of data from the NMDC API. Generic function to get a collection of data from the NMDC API. Can provide a specific filter if desired.

Parameters:
  • filter (str) – The filter to apply to the query. Default is an empty string.

  • max_page_size (int) – The maximum number of items to return per page. Default is 100.

  • fields (str) – The fields to return. Default is all fields.

Returns:

A list of records.

Return type:

list[dict]

Collection Helpers

class nmdc_api_utilities.collection_helpers.CollectionHelpers(env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to get additional information about collections. These functions may not be specific to a particular collection.

get_record_name_from_id(doc_id: str) str[source]

Used when you have an id but not the collection name. Determine the schema class by which the id belongs to.

Parameters:

doc_id (str) – The id of the document.

Returns:

The collection name of the document.

Return type:

str

Raises:

RuntimeError – If the API request fails.

Metadata Module

class nmdc_api_utilities.metadata.Metadata(env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API metadata.

validate_json(json_path: str) None[source]

Validates a json file using the NMDC json validate endpoint.

If the validation passes, the method returns without any side effects.

Parameters:

json_path (str) – The path to the json file to be validated.

Raises:

Exception – If the validation fails.

Mint Module

class nmdc_api_utilities.minter.Minter(env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to mint new identifiers.

mint(nmdc_type: str, client_id: str, client_secret: str) str[source]

Mint a new identifier for a collection.

Parameters:
  • nmdc_type (str) – The type of NMDC ID to mint (e.g., ‘nmdc:MassSpectrometry’, ‘nmdc:DataObject’).

  • client_id (str) – The client ID for the NMDC API.

  • client_secret (str) – The client secret for the NMDC API.

Returns:

The minted identifier.

Return type:

str

Raises:

RuntimeError – If the API request fails.

Notes

Security Warning: Your client_id and client_secret should be stored in a secure location.

We recommend using environment variables. Do not hard code these values in your code.

BioSample Subclass

class nmdc_api_utilities.biosample_search.BiosampleSearch(env='prod')[source]

Bases: LatLongFilters, CollectionSearch

Class to interact with the NMDC API to get biosamples.

Calibration Subclass

class nmdc_api_utilities.calibration_search.CalibrationSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get calibration records.

Chemical Entity Subclass

class nmdc_api_utilities.chemical_entity_search.ChemicalEntitySearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get chemical entities.

Collecting Biosamples From Site Subclass

class nmdc_api_utilities.collecting_biosamples_from_site_search.CollectingBiosamplesFromSiteSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get collecting biosamples from site sets.

Configuration Subclass

class nmdc_api_utilities.configuration_search.ConfigurationSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get cofiguration sets.

Data Generation Subclass

class nmdc_api_utilities.data_generation_search.DataGenerationSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get data generation sets.

Data Object Subclass

class nmdc_api_utilities.data_object_search.DataObjectSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get data object sets.

get_data_objects_for_studies(study_id: str, max_page_size: int = 100) list[dict][source]

Get data objects by study id. :param study_id: The study id to search for. :type study_id: str :param max_page_size: The maximum number of items to return per page. Default is 100 :type max_page_size: int

Returns:

A list of data objects.

Return type:

list[dict]

Raises:

RuntimeError – If the API request fails.

Field Research From Site Subclass

class nmdc_api_utilities.field_research_site_search.FieldResearchSiteSearch(env='prod')[source]

Bases: LatLongFilters, CollectionSearch

Class to interact with the NMDC API to get field research site sets.

Instrument Subclass

class nmdc_api_utilities.instrument_search.InstrumentSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get instrument sets.

Manifest Subclass

class nmdc_api_utilities.manifest_search.ManifestSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get genome menifest sets.

Material Subclass

class nmdc_api_utilities.material_processing_search.MaterialProcessingSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get material processing sets.

Processed Sample Subclass

class nmdc_api_utilities.processed_sample_search.ProcessedSampleSearch[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get process sample sets.

Protocol Execution Subclass

class nmdc_api_utilities.protocol_execution_search.ProtocolExecutionSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get protocol execution sets.

Storage Process Subclass

class nmdc_api_utilities.storage_process_search.StorageProcessSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get storage process sets.

Study Subclass

class nmdc_api_utilities.study_search.StudySearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get studies.

Functional Annotation Agg Subclass

class nmdc_api_utilities.functional_annotation_agg_search.FunctionalAnnotationAggSearch(env='prod')[source]

Bases: FunctionalSearch

Class to interact with the NMDC API to get functional annotation agg sets. These are most helpful when trying identify workflows associted with a KEGG, COG, or PFAM ids.

Workflow Execution Subclass

class nmdc_api_utilities.workflow_execution_search.WorkflowExecutionSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get workflow execution sets.

Data Processing

class nmdc_api_utilities.data_processing.DataProcessing[source]

Bases: object

build_filter(attributes: dict, exact_match: bool = False) dict[source]

Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.

Parameters:
  • attributes (dict) – Dictionary of attribute names and their corresponding values to match using regex. Example: {“name”: “example”, “description”: “example”, “geo_loc_name”: “example”}

  • exact_match (bool) – This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.

Returns:

A dictionary representing the MongoDB filter. Example: {“name”: {“$regex”: “example”, “$options”: “i”}, “description”: {“$regex”: “example”, “$options”: “i”}}

Return type:

dict

convert_to_df(data: list) DataFrame[source]

Convert a list of dictionaries to a pandas dataframe. :param data: A list of dictionaries. :type data: list

Returns:

A pandas dataframe.

Return type:

pd.DataFrame

extract_field(api_results: list, field_name: str) list[source]

This function is used to extract a field from the API results. :param api_results: A list of dictionaries. :type api_results: list :param field_name: The name of the field to extract. :type field_name: str

Returns:

A list of values for the specified field.

Return type:

list

merge_dataframes(column: str, df1: DataFrame, df2: DataFrame) DataFrame[source]

Merge two dataframes. :param column: The column to merge on. :type column: str :param df1: The first dataframe to merge. :type df1: pd.DataFrame :param df2: The second dataframe to merge. :type df2: pd.DataFrame

Returns:

A pandas dataframe with the merged data.

Return type:

pd.DataFrame

merge_df(df1: DataFrame, df2: DataFrame, key1: str, key2: str) DataFrame[source]

Define a merging function to join results This function merges new results with the previous results that were used for the new API request. It uses two keys from each result to match on. :param df1: The first dataframe to merge. :type df1: pd.DataFrame :param df2: The second dataframe to merge. :type df2: pd.DataFrame :param key1: The key in df1 to match with key2 in df2. :type key1: str :param key2: The key in df2 to match with key1 in df1. :type key2: str

Returns:

A pandas dataframe with the merged data.

Return type:

pd.DataFrame

rename_columns(df: DataFrame, new_col_names: list) DataFrame[source]

Rename columns in a pandas dataframe.

Parameters:
  • df (pd.DataFrame) – The pandas dataframe to rename columns.

  • new_col_names (list) –

    A list of new column names. Names MUST be in order of the columns in the dataframe.

    Example:

    If the current column names are - [‘old_col1’, ‘old_col2’, ‘old_col3’] You will need to pass in the new names like - [‘new_col1’, ‘new_col2’, ‘new_col3’]

Returns:

A pandas dataframe with renamed columns.

Return type:

pd.DataFrame

split_list(input_list: list, chunk_size: int = 100) list[source]

Split a list into chunks of a specified size. :param input_list: The list to split. :type input_list: list :param chunk_size: The size of each chunk. :type chunk_size: int

Returns:

list

Return type:

A list of lists.

Utils

class nmdc_api_utilities.utils.Utils[source]

Bases: object