NMDC API Utilities Documentation
Welcome to NMDC API Utilities documentation. This package provides tools for interacting with the NMDC API.
The Collection Module is a foundational component that defines common behaviors and properties between collections.
Each subclass is designed to be more user-friendly and specific for certain collections, making them the recommended entry points for using the package. Each function of CollectionSearch can be accessed via each subclass.
NMDC Module
Collection Module
- class nmdc_api_utilities.collection_search.CollectionSearch(collection_name, env='prod')[source]
Bases:
NMDCSearch
Class to interact with the NMDC API to get collections of data. Must know the collection name to query.
- check_ids_exist(ids: list, chunk_size=100) bool [source]
Check if the IDs exist in the collection.
This method constructs a query to the API to filter the collection based on the given IDs, and checks if all IDs exist in the collection.
- Parameters:
ids (list) – A list of IDs to check if they exist in the collection.
chunk_size (int) – The number of IDs to check in each query. Default is 100.
- Returns:
True if all IDs exist in the collection, False otherwise.
- Return type:
bool
- Raises:
requests.RequestException – If there’s an error in making the API request.
- get_record_by_attribute(attribute_name, attribute_value, max_page_size=25, fields='', all_pages=False, exact_match=False)[source]
Get a record from the NMDC API by its name. Records can be filtered based on their attributes found https://microbiomedata.github.io/nmdc-schema/. params:
- attribute_name: str
The name of the attribute to filter by.
- attribute_value: str
The value of the attribute to filter by.
- max_page_size: int
The number of results to return per page. Default is 25.
- fields: str
The fields to return. Default is all fields.
- all_pages: bool
True to return all pages. False to return the first page. Default is False.
- exact_match: bool
This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.
- get_record_by_filter(filter: str, max_page_size=25, fields='', all_pages=False)[source]
Get a record from the NMDC API by its id. params:
- filter: str
- The filter to use to query the collection. Must be in MonogDB query format.
Resources found here - https://www.mongodb.com/docs/manual/reference/method/db.collection.find/#std-label-method-find-query
Example: {“name”:{“my record name”}}
- max_page_size: int
The number of results to return per page. Default is 25.
- fields: str
The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”
- all_pages: bool
True to return all pages. False to return the first page. Default is False.
- get_record_by_id(collection_id: str, max_page_size: int = 100, fields: str = '')[source]
Get a collection of data from the NMDC API by id. params:
- collection_id: str
The id of the collection.
- max_page_size: int
The maximum number of items to return per page. Default is 100.
- fields: str
The fields to return. Default is all fields.
- get_records(filter: str = '', max_page_size: int = 100, fields: str = '', all_pages: bool = False)[source]
Get a collection of data from the NMDC API. Generic function to get a collection of data from the NMDC API. Can provide a specific filter if desired. params:
- filter: str
The filter to apply to the query. Default is an empty string.
- max_page_size: int
The maximum number of items to return per page. Default is 100.
- fields: str
The fields to return. Default is all fields.
Latitude Longitude Module
- class nmdc_api_utilities.lat_long_filters.LatLongFilters(collection_name, env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to filter sets by latitude and longitude.
- get_record_by_lat_long(lat_comparison: str, long_comparison: str, latitude: float, longitude: float, page_size=25, fields='', all_pages=False)[source]
Get a record from the NMDC API by latitude and longitude comparison. params:
- lat_comparison: str
- The comparison to use to query the record for latitude. MUST BE ONE OF THE FOLLOWING:
eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.
- long_comparison: str
- The comparison to use to query the record for longitude. MUST BE ONE OF THE FOLLOWING:
eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.
- latitude: float
The latitude of the record to query.
- longitude: float
The longitude of the record to query.
- page_size: int
The number of results to return per page. Default is 25.
- fields: str
The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”
- all_pages: bool
True to return all pages. False to return the first page. Default is False.
- get_record_by_latitude(comparison: str, latitude: float, page_size=25, fields='', all_pages=False)[source]
Get a record from the NMDC API by latitude comparison. params:
- comparison: str
- The comparison to use to query the record. MUST BE ONE OF THE FOLLOWING:
eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.
- latitude: float
The latitude of the record to query.
- page_size: int
The number of results to return per page. Default is 25.
- fields: str
The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”
- all_pages: bool
True to return all pages. False to return the first page. Default is False.
- get_record_by_longitude(comparison: str, longitude: float, page_size=25, fields='', all_pages=False)[source]
Get a record from the NMDC API by longitude comparison. params:
- comparison: str
- The comparison to use to query the record. MUST BE ONE OF THE FOLLOWING:
eq - Matches values that are equal to the given value. gt - Matches if values are greater than the given value. lt - Matches if values are less than the given value. gte - Matches if values are greater or equal to the given value. lte - Matches if values are less or equal to the given value.
- longitude: float
The longitude of the record to query.
- page_size: int
The number of results to return per page. Default is 25.
- fields: str
The fields to return. Default is all fields. Example: “id,name,description,alternative_identifiers,file_size_bytes,md5_checksum,data_object_type,url,type”
- all_pages: bool
True to return all pages. False to return the first page. Default is False.
Functional Search Module
- class nmdc_api_utilities.functional_search.FunctionalSearch(env='prod')[source]
Bases:
object
Class to interact with the NMDC API to filter functional annotations by KEGG, COG, or PFAM ids.
- get_functional_annotations(annotation: str, annotation_type: str, page_size=25, fields='', all_pages=False)[source]
Get a record from the NMDC API by id. ID types can be KEGG, COG, or PFAM. params:
- annotation: str
The data base id to query the function annotations.
- annotation_type:
- The type of id to query. MUST be one of the following:
KEGG COG PFAM
- page_size: int
The number of results to return per page. Default is 25.
- fields: str
The fields to return. Default is all fields. Example: “id,name”
- all_pages: bool
True to return all pages. False to return the first page. Default is False.
- get_records(filter: str = '', max_page_size: int = 100, fields: str = '', all_pages: bool = False)[source]
Get a collection of data from the NMDC API. Generic function to get a collection of data from the NMDC API. Can provide a specific filter if desired. params:
- filter: str
The filter to apply to the query. Default is an empty string.
- max_page_size: int
The maximum number of items to return per page. Default is 100.
- fields: str
The fields to return. Default is all fields.
Collection Helpers
- class nmdc_api_utilities.collection_helpers.CollectionHelpers(env='prod')[source]
Bases:
NMDCSearch
Class to interact with the NMDC API to get additional information about collections. These functions may not be specific to a particular collection.
Metadata Module
- class nmdc_api_utilities.metadata.Metadata(env='prod')[source]
Bases:
NMDCSearch
Class to interact with the NMDC API metadata.
Mint Module
- class nmdc_api_utilities.minter.Minter(env='prod')[source]
Bases:
NMDCSearch
Class to interact with the NMDC API to mint new identifiers.
- mint(nmdc_type: str, client_id: str, client_secret: str) str [source]
Mint a new identifier for a collection. params:
- nmdc_typestr
The type of NMDC ID to mint (e.g., ‘nmdc:MassSpectrometry’, ‘nmdc:DataObject’).
- client_idstr
The client ID for the NMDC API.
- client_secretstr
The client secret for the NMDC API.
- Returns:
str - the new identifier.
- Secrurity Note:
Your client_id and client_secret should be stored in a secure location. We recommend using environment variables. Do not hard code these values in your code.
BioSample Subclass
- class nmdc_api_utilities.biosample_search.BiosampleSearch(env='prod')[source]
Bases:
LatLongFilters
,CollectionSearch
Class to interact with the NMDC API to get biosamples.
Calibration Subclass
- class nmdc_api_utilities.calibration_search.CalibrationSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get calibration records.
Chemical Entity Subclass
- class nmdc_api_utilities.chemical_entity_search.ChemicalEntitySearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get chemical entities.
Collecting Biosamples From Site Subclass
- class nmdc_api_utilities.collecting_biosamples_from_site_search.CollectingBiosamplesFromSiteSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get collecting biosamples from site sets.
Configuration Subclass
- class nmdc_api_utilities.configuration_search.ConfigurationSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get cofiguration sets.
Data Generation Subclass
- class nmdc_api_utilities.data_generation_search.DataGenerationSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get data generation sets.
Data Object Subclass
- class nmdc_api_utilities.data_object_search.DataObjectSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get data object sets.
Field Research From Site Subclass
- class nmdc_api_utilities.field_research_site_search.FieldResearchSiteSearch(env='prod')[source]
Bases:
LatLongFilters
,CollectionSearch
Class to interact with the NMDC API to get field research site sets.
Instrument Subclass
- class nmdc_api_utilities.instrument_search.InstrumentSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get instrument sets.
Manifest Subclass
- class nmdc_api_utilities.manifest_search.ManifestSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get genome menifest sets.
Material Subclass
- class nmdc_api_utilities.material_processing_search.MaterialProcessingSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get material processing sets.
Processed Sample Subclass
- class nmdc_api_utilities.processed_sample_search.ProcessedSampleSearch[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get process sample sets.
Protocol Execution Subclass
- class nmdc_api_utilities.protocol_execution_search.ProtocolExecutionSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get protocol execution sets.
Storage Process Subclass
- class nmdc_api_utilities.storage_process_search.StorageProcessSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get storage process sets.
Study Subclass
- class nmdc_api_utilities.study_search.StudySearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get studies.
Functional Annotation Agg Subclass
- class nmdc_api_utilities.functional_annotation_agg_search.FunctionalAnnotationAggSearch(env='prod')[source]
Bases:
FunctionalSearch
Class to interact with the NMDC API to get functional annotation agg sets. These are most helpful when trying identify workflows associted with a KEGG, COG, or PFAM ids.
Workflow Execution Subclass
- class nmdc_api_utilities.workflow_execution_search.WorkflowExecutionSearch(env='prod')[source]
Bases:
CollectionSearch
Class to interact with the NMDC API to get workflow execution sets.
Data Processing
- class nmdc_api_utilities.data_processing.DataProcessing[source]
Bases:
object
- build_filter(attributes, exact_match=False)[source]
Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.
- Parameters:
attributes (dict) – Dictionary of attribute names and their corresponding values to match using regex. Example: {“name”: “example”, “description”: “example”, “geo_loc_name”: “example”}
exact_match – bool This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.
Returns: dict: A MongoDB filter dictionary.
- convert_to_df(data: list) DataFrame [source]
Convert a list of dictionaries to a pandas dataframe. params:
- data: list
A list of dictionaries.
- merge_dataframes(column: str, df1: DataFrame, df2: DataFrame) DataFrame [source]
Merge two dataframes. params:
- column: str
The column to merge on.
- df1: pd.DataFrame
The first dataframe to merge.
- df2: pd.DataFrame
The second dataframe to merge.
- Returns:
pd.DataFrame
- merge_df(df1, df2, key1: str, key2: str)[source]
Define a merging function to join results This function merges new results with the previous results that were used for the new API request. It uses two keys from each result to match on. params:
df1 and df2 are the two dataframes that need to be merged. key1 is the column name in df1 that will be used to match with key2 in df2.
This function automatically identifies columns that need to be exploded because they contain list-like elements, as drop_duplicates can’t handle list elements.
- rename_columns(df: DataFrame, new_col_names: list) DataFrame [source]
Rename columns in a pandas dataframe. params:
- df: pd.DataFrame
The pandas dataframe to rename columns.
- new_col_names: list
A list of new column names. Names MUST be in order of the columns in the dataframe.
- Example:
If the current column names are - [‘old_col1’, ‘old_col2’, ‘old_col3’] You will need to pass in the new names like - [‘new_col1’, ‘new_col2’, ‘new_col3’]