NMDC API Utilities Documentation

Welcome to NMDC API Utilities documentation. This package provides tools for interacting with the NMDC API.

The Collection Module is a foundational component that defines common behaviors and properties between collections.

Each subclass is designed to be more user-friendly and specific for certain collections, making them the recommended entry points for using the package. Each function of CollectionSearch can be accessed via each subclass.

NMDC Module

class nmdc_api_utilities.nmdc_search.NMDCSearch(env='prod')[source]

Bases: object

Base class for interacting with the NMDC API. Sets the base URL for the API based on the environment.

Collection Module

class nmdc_api_utilities.collection_search.CollectionSearch(collection_name, env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to get collections of data. Must know the collection name to query.

check_ids_exist(ids: list, chunk_size=100) bool[source]

Check if the IDs exist in the collection.

This method constructs a query to the API to filter the collection based on the given IDs, and checks if all IDs exist in the collection.

Parameters:
  • ids (list) – A list of IDs to check if they exist in the collection.

  • chunk_size (int) – The number of IDs to check in each query. Default is 100.

Returns:

True if all IDs exist in the collection, False otherwise.

Return type:

bool

Raises:

requests.RequestException – If there’s an error in making the API request.

get_record_by_attribute(attribute_name, attribute_value, max_page_size=25, fields='', all_pages=False, exact_match=False)[source]

Get a record from the NMDC API by its name. Records can be filtered based on their attributes found https://microbiomedata.github.io/nmdc-schema/. params:

attribute_name: str

The name of the attribute to filter by.

attribute_value: str

The value of the attribute to filter by.

max_page_size: int

The number of results to return per page. Default is 25.

fields: str

The fields to return. Default is all fields.

all_pages: bool

True to return all pages. False to return the first page. Default is False.

exact_match: bool

This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.

get_record_by_filter(filter: str, max_page_size=25, fields='', all_pages=False)[source]

Get a record from the NMDC API by its id. params:

filter: str
The filter to use to query the collection. Must be in MonogDB query format.

Resources found here - https://www.mongodb.com/docs/manual/reference/method/db.collection.find/#std-label-method-find-query

Example: {“name”:{“my record name”}}

max_page_size: int

The number of results to return per page. Default is 25.

fields: str

The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

all_pages: bool

True to return all pages. False to return the first page. Default is False.

get_record_by_id(collection_id: str, max_page_size: int = 100, fields: str = '')[source]

Get a collection of data from the NMDC API by id. params:

collection_id: str

The id of the collection.

max_page_size: int

The maximum number of items to return per page. Default is 100.

fields: str

The fields to return. Default is all fields.

get_records(filter: str = '', max_page_size: int = 100, fields: str = '', all_pages: bool = False)[source]

Get a collection of data from the NMDC API. Generic function to get a collection of data from the NMDC API. Can provide a specific filter if desired. params:

filter: str

The filter to apply to the query. Default is an empty string.

max_page_size: int

The maximum number of items to return per page. Default is 100.

fields: str

The fields to return. Default is all fields.

Latitude Longitude Module

class nmdc_api_utilities.lat_long_filters.LatLongFilters(collection_name, env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to filter sets by latitude and longitude.

get_record_by_lat_long(lat_comparison: str, long_comparison: str, latitude: float, longitude: float, page_size=25, fields='', all_pages=False)[source]

Get a record from the NMDC API by latitude and longitude comparison. params:

lat_comparison: str
The comparison to use to query the record for latitude. MUST BE ONE OF THE FOLLOWING:

eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

long_comparison: str
The comparison to use to query the record for longitude. MUST BE ONE OF THE FOLLOWING:

eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

latitude: float

The latitude of the record to query.

longitude: float

The longitude of the record to query.

page_size: int

The number of results to return per page. Default is 25.

fields: str

The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

all_pages: bool

True to return all pages. False to return the first page. Default is False.

get_record_by_latitude(comparison: str, latitude: float, page_size=25, fields='', all_pages=False)[source]

Get a record from the NMDC API by latitude comparison. params:

comparison: str
The comparison to use to query the record. MUST BE ONE OF THE FOLLOWING:

eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

latitude: float

The latitude of the record to query.

page_size: int

The number of results to return per page. Default is 25.

fields: str

The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

all_pages: bool

True to return all pages. False to return the first page. Default is False.

get_record_by_longitude(comparison: str, longitude: float, page_size=25, fields='', all_pages=False)[source]

Get a record from the NMDC API by longitude comparison. params:

comparison: str
The comparison to use to query the record. MUST BE ONE OF THE FOLLOWING:

eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.

longitude: float

The longitude of the record to query.

page_size: int

The number of results to return per page. Default is 25.

fields: str

The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”

all_pages: bool

True to return all pages. False to return the first page. Default is False.

Functional Search Module

class nmdc_api_utilities.functional_search.FunctionalSearch(env='prod')[source]

Bases: object

Class to interact with the NMDC API to filter functional annotations by KEGG, COG, or PFAM ids.

get_functional_annotations(annotation: str, annotation_type: str, page_size=25, fields='', all_pages=False)[source]

Get a record from the NMDC API by id. ID types can be KEGG, COG, or PFAM. params:

annotation: str

The data base id to query the function annotations.

annotation_type:
The type of id to query. MUST be one of the following:

KEGG COG PFAM

page_size: int

The number of results to return per page. Default is 25.

fields: str

The fields to return. Default is all fields. Example: “id,name”

all_pages: bool

True to return all pages. False to return the first page. Default is False.

get_records(filter: str = '', max_page_size: int = 100, fields: str = '', all_pages: bool = False)[source]

Get a collection of data from the NMDC API. Generic function to get a collection of data from the NMDC API. Can provide a specific filter if desired. params:

filter: str

The filter to apply to the query. Default is an empty string.

max_page_size: int

The maximum number of items to return per page. Default is 100.

fields: str

The fields to return. Default is all fields.

Collection Helpers

class nmdc_api_utilities.collection_helpers.CollectionHelpers(env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to get additional information about collections. These functions may not be specific to a particular collection.

get_record_name_from_id(doc_id: str)[source]

Used when you have an id but not the collection name. Determine the schema class by which the id belongs to. params:

doc_id: str

The id of the document.

Metadata Module

class nmdc_api_utilities.metadata.Metadata(env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API metadata.

validate_json(json_path) None[source]

Validates a json file using the NMDC json validate endpoint.

If the validation passes, the method returns without any side effects.

Parameters:

json_path (str) – The path to the json file to be validated.

Raises:

Exception – If the validation fails.

Mint Module

class nmdc_api_utilities.minter.Minter(env='prod')[source]

Bases: NMDCSearch

Class to interact with the NMDC API to mint new identifiers.

mint(nmdc_type: str, client_id: str, client_secret: str) str[source]

Mint a new identifier for a collection. params:

nmdc_typestr

The type of NMDC ID to mint (e.g., ‘nmdc:MassSpectrometry’, ‘nmdc:DataObject’).

client_idstr

The client ID for the NMDC API.

client_secretstr

The client secret for the NMDC API.

Returns:

str - the new identifier.

Secrurity Note:

Your client_id and client_secret should be stored in a secure location. We recommend using environment variables. Do not hard code these values in your code.

BioSample Subclass

class nmdc_api_utilities.biosample_search.BiosampleSearch(env='prod')[source]

Bases: LatLongFilters, CollectionSearch

Class to interact with the NMDC API to get biosamples.

Calibration Subclass

class nmdc_api_utilities.calibration_search.CalibrationSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get calibration records.

Chemical Entity Subclass

class nmdc_api_utilities.chemical_entity_search.ChemicalEntitySearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get chemical entities.

Collecting Biosamples From Site Subclass

class nmdc_api_utilities.collecting_biosamples_from_site_search.CollectingBiosamplesFromSiteSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get collecting biosamples from site sets.

Configuration Subclass

class nmdc_api_utilities.configuration_search.ConfigurationSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get cofiguration sets.

Data Generation Subclass

class nmdc_api_utilities.data_generation_search.DataGenerationSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get data generation sets.

Data Object Subclass

class nmdc_api_utilities.data_object_search.DataObjectSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get data object sets.

get_data_objects_for_studies(study_id: str, max_page_size: int = 100)[source]

Get data objects by study id. params:

study_id: str

The study id to search for.

max_page_size: int

The maximum number of items to return per page. Default is 100

Returns:

The results of the query.

Return type:

results

Field Research From Site Subclass

class nmdc_api_utilities.field_research_site_search.FieldResearchSiteSearch(env='prod')[source]

Bases: LatLongFilters, CollectionSearch

Class to interact with the NMDC API to get field research site sets.

Instrument Subclass

class nmdc_api_utilities.instrument_search.InstrumentSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get instrument sets.

Manifest Subclass

class nmdc_api_utilities.manifest_search.ManifestSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get genome menifest sets.

Material Subclass

class nmdc_api_utilities.material_processing_search.MaterialProcessingSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get material processing sets.

Processed Sample Subclass

class nmdc_api_utilities.processed_sample_search.ProcessedSampleSearch[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get process sample sets.

Protocol Execution Subclass

class nmdc_api_utilities.protocol_execution_search.ProtocolExecutionSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get protocol execution sets.

Storage Process Subclass

class nmdc_api_utilities.storage_process_search.StorageProcessSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get storage process sets.

Study Subclass

class nmdc_api_utilities.study_search.StudySearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get studies.

Functional Annotation Agg Subclass

class nmdc_api_utilities.functional_annotation_agg_search.FunctionalAnnotationAggSearch(env='prod')[source]

Bases: FunctionalSearch

Class to interact with the NMDC API to get functional annotation agg sets. These are most helpful when trying identify workflows associted with a KEGG, COG, or PFAM ids.

Workflow Execution Subclass

class nmdc_api_utilities.workflow_execution_search.WorkflowExecutionSearch(env='prod')[source]

Bases: CollectionSearch

Class to interact with the NMDC API to get workflow execution sets.

Data Processing

class nmdc_api_utilities.data_processing.DataProcessing[source]

Bases: object

build_filter(attributes, exact_match=False)[source]

Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.

Parameters:
  • attributes (dict) – Dictionary of attribute names and their corresponding values to match using regex. Example: {“name”: “example”, “description”: “example”, “geo_loc_name”: “example”}

  • exact_match – bool This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.

Returns: dict: A MongoDB filter dictionary.

convert_to_df(data: list) DataFrame[source]

Convert a list of dictionaries to a pandas dataframe. params:

data: list

A list of dictionaries.

merge_dataframes(column: str, df1: DataFrame, df2: DataFrame) DataFrame[source]

Merge two dataframes. params:

column: str

The column to merge on.

df1: pd.DataFrame

The first dataframe to merge.

df2: pd.DataFrame

The second dataframe to merge.

Returns:

pd.DataFrame

merge_df(df1, df2, key1: str, key2: str)[source]

Define a merging function to join results This function merges new results with the previous results that were used for the new API request. It uses two keys from each result to match on. params:

df1 and df2 are the two dataframes that need to be merged. key1 is the column name in df1 that will be used to match with key2 in df2.

This function automatically identifies columns that need to be exploded because they contain list-like elements, as drop_duplicates can’t handle list elements.

rename_columns(df: DataFrame, new_col_names: list) DataFrame[source]

Rename columns in a pandas dataframe. params:

df: pd.DataFrame

The pandas dataframe to rename columns.

new_col_names: list

A list of new column names. Names MUST be in order of the columns in the dataframe.

Example:

If the current column names are - [‘old_col1’, ‘old_col2’, ‘old_col3’] You will need to pass in the new names like - [‘new_col1’, ‘new_col2’, ‘new_col3’]

Utils

class nmdc_api_utilities.utils.Utils[source]

Bases: object