This page provides documentation for the functions in the package.

Bio Ontology API Class

class src.bio_ontology_api.BioOntologyInfoRetriever(bio_api_key: str)[source]

Bases: object

Client for retrieving ENVO term information from BioPortal API.

A class to handle authentication and retrieval of Environmental Ontology (ENVO) terms using the BioPortal REST API service.

Parameters:

bio_api_key (str) – The BioPortal BioOntology API key for authentication.

Notes

The configuration file should contain an ‘api_key’ field with a valid BioPortal API key.

Examples

>>> retriever = BioOntologyInfoRetriever('config.yaml')
>>> envo_terms = retriever.get_envo_terms('ENVO:00002042')
>>> print(envo_terms)
{'ENVO:00002042': 'surface water'}
get_envo_terms(envo_id: dict) dict[source]

Look up an ENVO term label using BioPortal API.

Parameters:

envo_id (dict) – The ENVO identifier to look up (e.g., ‘ENVO:00002042’)

Returns:

Dictionary with envo_id as key and term label as value Example: {‘ENVO:00002042’: ‘surface water’}

Return type:

dict

Notes

Makes an authenticated request to BioPortal API to retrieve the preferred label (prefLabel) for the given ENVO term.

Metadata Parser Base Class

class src.metadata_parser.MetadataParser[source]

Bases: object

Parsers metadata from input metadata spreadsheet.

create_controlled_identified_term_value(row_value: str, slot_enum_dict: dict) dict[source]

Create a controlled identified term value.

Parameters:
  • row_value (str) – The raw value to be converted.

  • slot_enum_dict (dict) – A dictionary mapping the raw value to its corresponding term.

Returns:

A dictionary representing the controlled identified term.

Return type:

dict

create_geo_loc_value(raw_value: str) dict[source]

Create a geolocation value representation.

Parameters:

raw_value (str) – The raw value associated with geolocation.

Returns:

A dictionary representing the geolocation value.

Return type:

dict

create_quantity_value(value_dict: dict = None) dict[source]

Create a quantity value representation. Since a dictionary is passed in, we need to check if any of the values are None and remove them if so. Also adds the Quantity value type.

Parameters:

value_dict (dict) –

A dictionary containing the raw value and other attributes gathered from the metadata. This is a dict of the form: {

”has_numeric_value”: float, “has_minimum_numeric_value”: float, “has_maximum_numeric_value”: float, “has_unit”: str, “has_raw_value”: str

} The keys in the dictionary are the attributes of the QuantityValue class. They may be passed in as None if they are not present in the metadata.

Returns:

A dictionary representing the quantity value.

Return type:

dict

create_text_value(row_value: str, is_list: bool) dict[source]

Create a text value representation.

Parameters:
  • row_value (str) – The raw value to convert.

  • is_list (bool) – Whether to treat the value as a list.

Returns:

A dictionary representing the text value.

Return type:

dict

create_timestamp_value(raw_value: str) dict[source]

Create a timestamp value representation.

Parameters:

raw_value (str) – The raw value to convert to a timestamp.

Returns:

A dictionary representing the timestamp value.

Return type:

dict

dynam_parse_biosample_metadata(row: Series, bio_api_key: str) dict[source]

Function to parse the metadata row if it includes biosample information. This pulls the most recent version of the ontology terms from the API and compares them to the values in the given row. Different parsing is done on different types of fields, such as lists, controlled identified terms, and text values to ensure the correct format is used.

Parameters:
  • row (pd.Series) – A row from the DataFrame containing metadata.

  • bio_api_key (str) – The API key to access the Bio Ontology API

Returns:

metadata – The metadata dictionary.

Return type:

dict

generate_example_biosample_csv(file_path: str = 'example_biosample_metadata.csv')[source]

Function to generate an example csv file from available NMDCSchema Biosample fields. Saves the file to the given path.

Parameters:

file_path (str) – The path to save the example CSV file. Default is “example_biosample_metadata.csv”.

Return type:

None

get_value(row: Series, key: str, default: str = None) str[source]

Retrieve a value from a row, handling missing or NaN values.

Parameters:
  • row (pd.Series) – A row from the DataFrame.

  • key (str) – The key to retrieve the value for.

  • default (str, optional) – Default value to return if the key does not exist or is NaN.

Returns:

The value associated with the key, or default if not found.

Return type:

str

is_type(type_hint, type_to_search_for) bool[source]

Recursively check if a type hint is or contains input type.

parse_biosample_metadata(row: Series) Dict[source]

Parse the metadata row to get non-biosample class information.

Parameters:

row (pd.Series) – A row from the DataFrame containing metadata.

Return type:

Dict

Metadata Generator Base Class

class src.metadata_generator.NMDCMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str)[source]

Bases: ABC

Abstract class for generating NMDC metadata objects using provided metadata files and configuration.

Parameters:
  • metadata_file (str) – Path to the input CSV metadata file.

  • database_dump_json_path (str) – Path where the output database dump JSON file will be saved.

  • raw_data_url (str) – Base URL for the raw data files.

  • process_data_url (str) – Base URL for the processed data files.

check_doj_urls(metadata_df: DataFrame, url_columns: List) None[source]

Check if the URLs in the input list already exist in the database.

Parameters:
  • metadata_df (pd.DataFrame) – The DataFrame containing the metadata information.

  • url_columns (List) – The list of columns in the DataFrame that contain URLs to check.

Return type:

None

Raises:
  • ValueError – If any URL in the metadata DataFrame is invalid or inaccessible.

  • FileNotFoundError – If no files are found in the specified directory columns.

check_for_biosamples(metadata_df: DataFrame, nmdc_database_inst: Database, CLIENT_ID: str, CLIENT_SECRET: str) None[source]

This method verifies the presence of the ‘biosample_id’ in the provided metadata DataFrame. It will loop over each row to verify the presence of the ‘biosample_id’, giving the option for some rows to need generation and some to already exist. If the ‘biosample_id’ is missing, it checks for the presence of required columns to generate a new biosample_id using the NMDC API. If they are all there, the function calls the dynam_parse_biosample_metadata method from the MetadataParser class to create the JSON for the biosample. If the required columns are missing and there is no biosample_id - it raises a ValueError. After the biosample_id is generated,it updates the DataFrame row with the newly minted biosample_id and the NMDC database instance with the new biosample JSON.

Parameters:
  • metadata_df (pd.DataFrame) – the dataframe containing the metadata information.

  • nmdc_database_inst (nmdc.Database) – The NMDC Database instance to add the biosample to if one needs to be generated.

  • CLIENT_ID (str) – The client ID for the NMDC API. Used to mint a biosmaple id if one does not exist.

  • CLIENT_SECRET (str) – The client secret for the NMDC API. Used to mint a biosmaple id if one does not exist.

Return type:

None

Raises:

ValueError – If the ‘biosample.name’ column is missing and ‘biosample_id’ is empty. If any required columns for biosample generation are missing.

clean_dict(dict: Dict) Dict[source]

Clean the dictionary by removing keys with empty or None values.

Parameters:

dict (Dict) – The dictionary to be cleaned.

Returns:

A new dictionary with keys removed where the values are None, an empty string, or a string with only whitespace.

Return type:

Dict

dump_nmdc_database(nmdc_database: Database) None[source]

Dump the NMDC database to a JSON file.

This method serializes the NMDC Database instance to a JSON file at the specified path.

Parameters:

nmdc_database (nmdc.Database) – The NMDC Database instance to dump.

Returns:

  • None

  • Side Effects

  • ————

  • Writes the database content to the file specified by

  • self.database_dump_json_path.

generate_biosample(biosamp_metadata: dict, CLIENT_ID: str, CLIENT_SECRET: str) Biosample[source]

Mint a biosample id from the given metadata and create a biosample instance.

Parameters:
  • biosamp_metadata (dict) – The metadata object containing biosample information.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

Returns:

The generated biosample instance.

Return type:

nmdc.Biosample

generate_data_object(file_path: Path, data_category: str, data_object_type: str, description: str, base_url: str, CLIENT_ID: str, CLIENT_SECRET: str, was_generated_by: str = None, alternative_id: str = None) DataObject[source]

Create an NMDC DataObject with metadata from the specified file and details.

This method generates an NMDC DataObject and assigns it a unique NMDC ID. The DataObject is populated with metadata derived from the provided file and input parameters.

Parameters:
  • file_path (Path) – Path to the file representing the data object. The file’s name is used as the name attribute.

  • data_category (str) – Category of the data object (e.g., ‘instrument_data’).

  • data_object_type (str) – Type of the data object (e.g., ‘LC-DDA-MS/MS Raw Data’).

  • description (str) – Description of the data object.

  • base_url (str) – Base URL for accessing the data object, to which the file name is appended to form the complete URL.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

  • was_generated_by (str, optional) – ID of the process or entity that generated the data object (e.g., the DataGeneration id or the MetabolomicsAnalysis id).

  • alternative_id (str, optional) – An optional alternative identifier for the data object.

Returns:

An NMDC DataObject instance with the specified metadata.

Return type:

nmdc.DataObject

Notes

This method calculates the MD5 checksum of the file, which may be time-consuming for large files.

generate_mass_spectrometry(file_path: Path, instrument_name: str, sample_id: str, raw_data_id: str, study_id: str, processing_institution: str, mass_spec_config_name: str, start_date: str, end_date: str, CLIENT_ID: str, CLIENT_SECRET: str, lc_config_name: str = None, calibration_id: str = None) DataGeneration[source]

Create an NMDC DataGeneration object for mass spectrometry and mint an NMDC ID.

Parameters:
  • file_path (Path) – File path of the mass spectrometry data.

  • instrument_name (str) – Name of the instrument used for data generation.

  • sample_id (str) – ID of the input sample associated with the data generation.

  • raw_data_id (str) – ID of the raw data object associated with the data generation.

  • study_id (str) – ID of the study associated with the data generation.

  • processing_institution (str) – Name of the processing institution.

  • mass_spec_config_name (str) – Name of the mass spectrometry configuration.

  • start_date (str) – Start date of the data generation.

  • end_date (str) – End date of the data generation.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

  • lc_config_name (str) – Name of the liquid chromatography configuration.

  • calibration_id (str, optional) – ID of the calibration information generated with the data. Default is None, indicating no calibration information.

Returns:

An NMDC DataGeneration object with the provided metadata.

Return type:

nmdc.DataGeneration

Notes

This method uses the nmdc_api_utilities package to fetch IDs for the instrument and configurations. It also mints a new NMDC ID for the DataGeneration object.

generate_metabolomics_analysis(cluster_name: str, raw_data_name: str, raw_data_id: str, data_gen_id: str, processed_data_id: str, parameter_data_id: str, processing_institution: str, CLIENT_ID: str, CLIENT_SECRET: str, calibration_id: str = None, incremeneted_id: str = None, metabolite_identifications: List[MetaboliteIdentification] = None, type: str = 'nmdc:MetabolomicsAnalysis') MetabolomicsAnalysis[source]

Create an NMDC MetabolomicsAnalysis object with metadata for a workflow analysis.

This method generates an NMDC MetabolomicsAnalysis object, including details about the analysis, the processing institution, and relevant workflow information.

Parameters:
  • cluster_name (str) – Name of the cluster or computing resource used for the analysis.

  • raw_data_name (str) – Name of the raw data file that was analyzed.

  • raw_data_id (str) – ID of the raw data object that was analyzed.

  • data_gen_id (str) – ID of the DataGeneration object that generated the raw data.

  • processed_data_id (str) – ID of the processed data resulting from the analysis.

  • parameter_data_id (str) – ID of the parameter data object used for the analysis.

  • processing_institution (str) – Name of the institution where the analysis was performed.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

  • calibration_id (str, optional) – ID of the calibration information used for the analysis. Default is None, indicating no calibration information.

  • incremeneted_id (str, optional) – An optional incremented ID for the MetabolomicsAnalysis object. If not provided, a new NMDC ID will be minted.

  • metabolite_identifications (List[nmdc.MetaboliteIdentification], optional) – List of MetaboliteIdentification objects associated with the analysis. Default is None, which indicates no metabolite identifications.

  • type (str, optional) – The type of the analysis. Default is NmdcTypes.MetabolomicsAnalysis.

Returns:

An NMDC MetabolomicsAnalysis instance with the provided metadata.

Return type:

nmdc.MetabolomicsAnalysis

Notes

The ‘started_at_time’ and ‘ended_at_time’ fields are initialized with placeholder values and should be updated with actual timestamps later when the processed files are iterated over in the run method.

handle_biosample(row: Series) tuple[source]

Process biosample information from metadata row.

Checks if a biosample ID exists in the row. If it does, returns the existing biosample information. If not, generates a new biosample.

Parameters:

row (pd.Series) – A row from the metadata DataFrame containing biosample information

Returns:

A tuple containing: - emsl_metadata : Dict

Parsed metadata from input csv row

  • biosample_idstr

    The ID of the biosample (existing or newly generated)

Return type:

tuple

load_bio_credentials(config_file: str = None) str[source]

Load bio ontology API key from the environment or a configuration file.

Parameters:

config_file (str) – The path to the configuration file.

Returns:

The bio ontology API key.

Return type:

str

Raises:
  • FileNotFoundError – If the configuration file is not found, and the API key is not set in the environment.

  • ValueError – If the configuration file is not valid or does not contain the API key.

load_credentials(config_file: str = None) tuple[source]

Load the client ID and secret from the environment or a configuration file.

Parameters:

config_file (str) – The path to the configuration file.

Returns:

A tuple containing the client ID and client secret.

Return type:

tuple

load_metadata() DataFrame[source]

Load and group workflow metadata from a CSV file.

This method reads the metadata CSV file, checks for uniqueness in specified columns, checks that biosamples exist, and groups the data by biosample ID.

Returns:

A DataFrame containing the loaded and grouped metadata.

Return type:

pd.core.frame.DataFrame

Raises:
  • FileNotFoundError – If the metadata_file does not exist.

  • ValueError – If values in columns ‘Raw Data File’, and ‘Processed Data Directory’ are not unique.

Notes

See example_metadata_file.csv in this directory for an example of the expected input file format.

start_nmdc_database() Database[source]

Initialize and return a new NMDC Database instance.

Returns:

A new instance of an NMDC Database.

Return type:

nmdc.Database

Notes

This method simply creates and returns a new instance of the NMDC Database. It does not perform any additional initialization or configuration.

update_outputs(analysis_obj: object, raw_data_obj_id: str, parameter_data_id: str, processed_data_id_list: list, mass_spec_obj: object = None, rerun: bool = False) None[source]

Update output references for Mass Spectrometry and Workflow Analysis objects.

This method assigns the output references for a Mass Spectrometry object and a Workflow Execution Analysis object. It sets mass_spec_obj.has_output to the ID of raw_data_obj and analysis_obj.has_output to a list of processed data IDs.

Parameters:
  • analysis_obj (object) – The Workflow Execution Analysis object to update (e.g., MetabolomicsAnalysis).

  • raw_data_obj_id (str) – The Raw Data Object associated with the Mass Spectrometry.

  • parameter_data_id (str) – ID of the data object representing the parameter data used for the analysis.

  • processed_data_id_list (list) – List of IDs representing processed data objects associated with the Workflow Execution.

  • mass_spec_obj (object , optional) – The Mass Spectrometry object to update. Optional for rerun cases.

  • rerun (bool, optional) – If True, this indicates the run is a rerun, and the method will not set mass_spec_obj.has_output because there is not one. Default is False.

Return type:

None

Notes

  • Sets mass_spec_obj.has_output to [raw_data_obj.id].

  • Sets analysis_obj.has_output to processed_data_id_list.

LC/MS Metadata Generator Base Class

class src.lcms_metadata_generator.LCMSMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str)[source]

Bases: NMDCMetadataGenerator

A class for generating NMDC metadata objects using provided metadata files and configuration for LC-MS data.

This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format.

create_workflow_metadata(row: dict[str, str]) LCMSLipidWorkflowMetadata[source]

Create a LCMSLipidWorkflowMetadata object from a dictionary of workflow metadata.

Parameters:

row (dict[str, str]) – Dictionary containing metadata for a workflow. This is typically a row from the input metadata CSV file.

Returns:

A LCMSLipidWorkflowMetadata object populated with data from the input dictionary.

Return type:

LCMSLipidWorkflowMetadata

Notes

The input dictionary is expected to contain the following keys: ‘Processed Data Directory’, ‘Raw Data File’, ‘Raw Data Object Alt Id’, ‘mass spec configuration name’, ‘lc config name’, ‘instrument used’, ‘instrument analysis start date’, ‘instrument analysis end date’, ‘execution resource’.

rerun() None[source]

Execute a rerun of the metadata generation process for metabolomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Metabolomics Analysis and Processed Data objects. 4. Update outputs for the Metabolomics Analysis object. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.

Return type:

None

Raises:
  • FileNotFoundError – If the processed data directory is empty or not found.

  • ValueError – If the number of files in the processed data directory is not as expected.

Notes

This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.

run() None[source]

Execute the metadata generation process for lipidomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Mass Spectrometry, Raw Data, Metabolomics Analysis, and Processed Data objects. 4. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.

Return type:

None

Raises:
  • FileNotFoundError – If the processed data directory is empty or not found.

  • ValueError – If the number of files in the processed data directory is not as expected

Notes

This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.

Data Class

class src.data_classes.NmdcTypes(Biosample: str = 'nmdc:Biosample', MassSpectrometry: str = 'nmdc:MassSpectrometry', MetabolomicsAnalysis: str = 'nmdc:MetabolomicsAnalysis', DataObject: str = 'nmdc:DataObject', CalibrationInformation: str = 'nmdc:CalibrationInformation', MetaboliteIdentification: str = 'nmdc:MetaboliteIdentification', NomAnalysis: str = 'nmdc:NomAnalysis', OntologyClass: str = 'nmdc:OntologyClass', ControlledIdentifiedTermValue: str = 'nmdc:ControlledIdentifiedTermValue', TextValue: str = 'nmdc:TextValue', GeolocationValue: str = 'nmdc:GeolocationValue', TimeStampValue: str = 'nmdc:TimestampValue', QuantityValue: str = 'nmdc:QuantityValue')[source]

Bases: object

Data class holding NMDC type constants.

Biosample

NMDC type for Biosample.

Type:

str

MassSpectrometry

NMDC type for Mass Spectrometry.

Type:

str

MetabolomicsAnalysis

NMDC type for Metabolomics Analysis.

Type:

str

DataObject

NMDC type for Data Object.

Type:

str

CalibrationInformation

NMDC type for Calibration Information.

Type:

str

MetaboliteIdentification

NMDC type for Metabolite Identification.

Type:

str

NomAnalysis

NMDC type for NOM Analysis.

Type:

str

OntologyClass

NMDC type for Ontology Class.

Type:

str

ControlledIdentifiedTermValue

NMDC type for Controlled Identified Term Value.

Type:

str

TextValue

NMDC type for Text Value.

Type:

str

GeolocationValue

NMDC type for Geolocation Value.

Type:

str

TimeStampValue

NMDC type for Timestamp Value.

Type:

str

QuantityValue

NMDC type for Quantity Value.

Type:

str

Biosample: str = 'nmdc:Biosample'
CalibrationInformation: str = 'nmdc:CalibrationInformation'
ControlledIdentifiedTermValue: str = 'nmdc:ControlledIdentifiedTermValue'
DataObject: str = 'nmdc:DataObject'
GeolocationValue: str = 'nmdc:GeolocationValue'
MassSpectrometry: str = 'nmdc:MassSpectrometry'
MetaboliteIdentification: str = 'nmdc:MetaboliteIdentification'
MetabolomicsAnalysis: str = 'nmdc:MetabolomicsAnalysis'
NomAnalysis: str = 'nmdc:NomAnalysis'
OntologyClass: str = 'nmdc:OntologyClass'
QuantityValue: str = 'nmdc:QuantityValue'
TextValue: str = 'nmdc:TextValue'
TimeStampValue: str = 'nmdc:TimestampValue'
class src.data_classes.GCMSMetabWorkflowMetadata(biosample_id: str, nmdc_study: str, processing_institution: str, processed_data_file: str, raw_data_file: str, mass_spec_config_name: str, chromat_config_name: str, instrument_used: str, instrument_analysis_start_date: str, instrument_analysis_end_date: str, execution_resource: float, calibration_id: str)[source]

Bases: object

Data class for holding GCMS metabolomic workflow metadata information.

biosample_id

Identifier for the biosample.s

Type:

str

nmdc_study

Identifier for the NMDC study.

Type:

str

processing_institution

Name of the institution processing the data.

Type:

str

processed_data_file

Path or name of the processed data file.

Type:

str

raw_data_file

Path or name of the raw data file.

Type:

str

mass_spec_config_name

Name of the mass spectrometry configuration used.

Type:

str

chromat_config_name

Name of the chromatography configuration used.

Type:

str

instrument_used

Name of the instrument used for analysis.

Type:

str

instrument_analysis_start_date

Start date of the instrument analysis.

Type:

str

instrument_analysis_end_date

End date of the instrument analysis.

Type:

str

execution_resource

Identifier for the execution resource.

Type:

float

calibration_id

Identifier for the calibration information used.

Type:

str

biosample_id: str
calibration_id: str
chromat_config_name: str
execution_resource: float
instrument_analysis_end_date: str
instrument_analysis_start_date: str
instrument_used: str
mass_spec_config_name: str
nmdc_study: str
processed_data_file: str
processing_institution: str
raw_data_file: str
class src.data_classes.LCMSLipidWorkflowMetadata(processed_data_dir: str, raw_data_file: str, mass_spec_config_name: str, lc_config_name: str, instrument_used: str, instrument_analysis_start_date: str, instrument_analysis_end_date: str, execution_resource: float)[source]

Bases: object

Data class for holding LC-MS lipidomics workflow metadata information.

processed_data_dir

Directory containing processed data files.

Type:

str

raw_data_file

Path or name of the raw data file.

Type:

str

mass_spec_config_name

Name of the mass spectrometry configuration used.

Type:

str

lc_config_name

Name of the liquid chromatography configuration used.

Type:

str

instrument_used

Name of the instrument used for analysis.

Type:

str

instrument_analysis_start_date

Start date of the instrument analysis.

Type:

str

instrument_analysis_end_date

End date of the instrument analysis.

Type:

str

execution_resource

Identifier for the execution resource.

Type:

float

execution_resource: float
instrument_analysis_end_date: str
instrument_analysis_start_date: str
instrument_used: str
lc_config_name: str
mass_spec_config_name: str
processed_data_dir: str
raw_data_file: str

LC/MS Lipidomics Metadata Generator Subclass

class src.lcms_lipid_metadata_generator.LCMSLipidomicsMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None)[source]

Bases: LCMSMetadataGenerator

A class for generating NMDC metadata objects using provided metadata files and configuration for LC-MS lipidomics data.

This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format. :param metadata_file: Path to the input CSV metadata file. :type metadata_file: str :param database_dump_json_path: Path where the output database dump JSON file will be saved. :type database_dump_json_path: str :param raw_data_url: Base URL for the raw data files. :type raw_data_url: str :param process_data_url: Base URL for the processed data files. :type process_data_url: str :param minting_config_creds: Path to the configuration file containing the client ID and client secret for minting NMDC IDs. It can also include the bio ontology API key if generating biosample ids is needed.

If not provided, the CLIENT_ID, CLIENT_SECRET, and BIO_API_KEY environment variables will be used.

unique_columns

List of unique columns in the metadata file.

Type:

List[str]

mass_spec_desc

Description of the mass spectrometry analysis.

Type:

str

mass_spec_eluent_intro

Eluent introduction category for mass spectrometry.

Type:

str

analyte_category

Category of the analyte.

Type:

str

raw_data_obj_type

Type of the raw data object.

Type:

str

raw_data_obj_desc

Description of the raw data object.

Type:

str

workflow_analysis_name

Name of the workflow analysis.

Type:

str

workflow_description

Description of the workflow.

Type:

str

workflow_git_url

URL of the workflow’s Git repository.

Type:

str

workflow_version

Version of the workflow.

Type:

str

workflow_category

Category of the workflow.

Type:

str

wf_config_process_data_category

Category of the workflow configuration process data.

Type:

str

wf_config_process_data_obj_type

Type of the workflow configuration process data object.

Type:

str

wf_config_process_data_description

Description of the workflow configuration process data.

Type:

str

no_config_process_data_category

Category for processed data without configuration.

Type:

str

no_config_process_data_obj_type

Type of processed data object without configuration.

Type:

str

csv_process_data_description

Description of CSV processed data.

Type:

str

hdf5_process_data_obj_type

Type of HDF5 processed data object.

Type:

str

hdf5_process_data_description

Description of HDF5 processed data.

Type:

str

analyte_category: str = 'lipidome'
csv_process_data_description: str = 'Lipid annotations as a result of a lipidomics workflow activity.'
hdf5_process_data_description: str = 'CoreMS hdf5 file representing a lipidomics data file including annotations.'
hdf5_process_data_obj_type: str = 'LC-MS Lipidomics Processed Data'
mass_spec_desc: str = 'Generation of mass spectrometry data for the analysis of lipids.'
mass_spec_eluent_intro: str = 'liquid_chromatography'
no_config_process_data_category: str = 'processed_data'
no_config_process_data_obj_type: str = 'LC-MS Lipidomics Results'
raw_data_obj_desc: str = 'LC-DDA-MS/MS raw data for lipidomics data acquisition.'
raw_data_obj_type: str = 'LC-DDA-MS/MS Raw Data'
rerun()[source]

Execute a rerun of the metadata generation process for metabolomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Metabolomics Analysis and Processed Data objects. 4. Update outputs for the Metabolomics Analysis object. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.

Return type:

None

Raises:
  • FileNotFoundError – If the processed data directory is empty or not found.

  • ValueError – If the number of files in the processed data directory is not as expected.

Notes

This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.

run()[source]

Execute the metadata generation process for lipidomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Mass Spectrometry, Raw Data, Metabolomics Analysis, and Processed Data objects. 4. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.

Return type:

None

Raises:
  • FileNotFoundError – If the processed data directory is empty or not found.

  • ValueError – If the number of files in the processed data directory is not as expected

Notes

This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.

unique_columns: List[str] = ['raw_data_file', 'processed_data_directory']
wf_config_process_data_category: str = 'workflow_parameter_data'
wf_config_process_data_description: str = 'CoreMS parameters used for Lipidomics workflow.'
wf_config_process_data_obj_type: str = 'Configuration toml'
workflow_analysis_name: str = 'Lipidomics analysis'
workflow_category: str = 'lc_ms_lipidomics'
workflow_description: str = 'Analysis of raw mass spectrometry data for the annotation of lipids.'
workflow_git_url: str = 'https://github.com/microbiomedata/metaMS/wdl/metaMS_lipidomics.wdl'
workflow_version: str = '1.0.0'

LC/MS Metabolomics Metadata Generator Subclass

class src.lcms_metab_metadata_generator.LCMSMetabolomicsMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None)[source]

Bases: LCMSMetadataGenerator

A class for generating NMDC metadata objects using provided metadata files and configuration for LC-MS metabolomics data.

This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format. :param metadata_file: Path to the input CSV metadata file. :type metadata_file: str :param database_dump_json_path: Path where the output database dump JSON file will be saved. :type database_dump_json_path: str :param raw_data_url: Base URL for the raw data files. :type raw_data_url: str :param process_data_url: Base URL for the processed data files. :type process_data_url: str :param minting_config_creds: Path to the configuration file containing the client ID and client secret for minting NMDC IDs. It can also include the bio ontology API key if generating biosample ids is needed.

If not provided, the CLIENT_ID, CLIENT_SECRET, and BIO_API_KEY environment variables will be used.

unique_columns

List of unique columns in the metadata file.

Type:

list[str]

mass_spec_desc

Description of the mass spectrometry analysis.

Type:

str

mass_spec_eluent_intro

Eluent introduction category for mass spectrometry.

Type:

str

analyte_category

Category of the analyte.

Type:

str

raw_data_obj_type

Type of the raw data object.

Type:

str

raw_data_obj_desc

Description of the raw data object.

Type:

str

workflow_analysis_name

Name of the workflow analysis.

Type:

str

workflow_description

Description of the workflow.

Type:

str

workflow_git_url

URL of the workflow’s Git repository.

Type:

str

workflow_version

Version of the workflow.

Type:

str

workflow_category

Category of the workflow.

Type:

str

wf_config_process_data_category

Category of the workflow configuration process data.

Type:

str

wf_config_process_data_obj_type

Type of the workflow configuration process data object.

Type:

str

wf_config_process_data_description

Description of the workflow configuration process data.

Type:

str

no_config_process_data_category

Category for processed data without configuration.

Type:

str

no_config_process_data_obj_type

Type of processed data object without configuration.

Type:

str

csv_process_data_description

Description of CSV processed data.

Type:

str

hdf5_process_data_obj_type

Type of HDF5 processed data object.

Type:

str

hdf5_process_data_description

Description of HDF5 processed data.

Type:

str

analyte_category: str = 'metabolome'
csv_process_data_description: str = 'Metabolite annotations as a result of a metabolomics workflow activity.'
hdf5_process_data_description: str = 'CoreMS hdf5 file representing a metabolomics data file including annotations.'
hdf5_process_data_obj_type: str = 'LC-MS Metabolomics Processed Data'
mass_spec_desc: str = 'Generation of mass spectrometry data for the analysis of metabolomics using liquid chromatography.'
mass_spec_eluent_intro: str = 'liquid_chromatography'
no_config_process_data_category: str = 'processed_data'
no_config_process_data_obj_type: str = 'LC-MS Metabolomics Results'
raw_data_obj_desc: str = 'LC-DDA-MS/MS raw data for metabolomics data acquisition.'
raw_data_obj_type: str = 'LC-DDA-MS/MS Raw Data'
rerun()[source]

Execute a rerun of the metadata generation process for metabolomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Metabolomics Analysis and Processed Data objects. 4. Update outputs for the Metabolomics Analysis object. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.

Return type:

None

Raises:
  • FileNotFoundError – If the processed data directory is empty or not found.

  • ValueError – If the number of files in the processed data directory is not as expected.

Notes

This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.

run()[source]

Execute the metadata generation process for lipidomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Mass Spectrometry, Raw Data, Metabolomics Analysis, and Processed Data objects. 4. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.

Return type:

None

Raises:
  • FileNotFoundError – If the processed data directory is empty or not found.

  • ValueError – If the number of files in the processed data directory is not as expected

Notes

This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.

unique_columns: list[str] = ['raw_data_file', 'processed_data_directory']
wf_config_process_data_category: str = 'workflow_parameter_data'
wf_config_process_data_description: str = 'CoreMS parameters used for metabolomics workflow.'
wf_config_process_data_obj_type: str = 'Configuration toml'
workflow_analysis_name: str = 'Metabolomics analysis'
workflow_category: str = 'lc_ms_metabolomics'
workflow_description: str = 'Analysis of raw mass spectrometry data for the annotation of metabolites.'
workflow_git_url: str = 'https://github.com/microbiomedata/metaMS/wdl/metaMS_lcms_metabolomics.wdl'
workflow_version: str = '1.0.0'

GC/MS Metabolomics Metadata Generator Subclass

class src.gcms_metab_metadata_generator.GCMSMetabolomicsMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None, calibration_standard: str = 'fames', configuration_file_name: str = 'emsl_gcms_corems_params.toml')[source]

Bases: NMDCMetadataGenerator

A class for generating NMDC metadata objects related to GC/MS metabolomics data.

This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format.

Parameters:
  • metadata_file (str) – Path to the metadata CSV file.

  • database_dump_json_path (str) – Path to the output JSON file for the NMDC database dump.

  • raw_data_url (str) – Base URL for the raw data files.

  • process_data_url (str) – Base URL for the processed data files.

  • minting_config_creds (str) – Path to the minting configuration credentials file.

  • calibration_standard (str, optional) – Calibration standard used for the data. Default is “fames”.

  • configuration_file_name (str, optional) – Name of the configuration file. Default is “emsl_gcms_corems_params.toml”.

unique_columns

List of columns used to check for uniqueness in the metadata before processing.

Type:

List[str]

mass_spec_desc

Description of the mass spectrometry analysis.

Type:

str

mass_spec_eluent_intro

Eluent introduction category for mass spectrometry.

Type:

str

analyte_category

Category of the analyte.

Type:

str

raw_data_obj_type

Type of the raw data object.

Type:

str

raw_data_obj_desc

Description of the raw data object.

Type:

str

workflow_analysis_name

Name of the workflow analysis.

Type:

str

workflow_description

Description of the workflow.

Type:

str

workflow_git_url

URL of the workflow’s Git repository.

Type:

str

workflow_version

Version of the workflow.

Type:

str

workflow_category

Category of the workflow.

Type:

str

processed_data_category

Category of the processed data.

Type:

str

processed_data_object_type

Type of the processed data object.

Type:

str

processed_data_object_description
Type:

str

analyte_category: str = 'metabolome'
create_workflow_metadata(row: dict[str, str]) GCMSMetabWorkflowMetadata[source]

Create a GCMSMetabWorkflowMetadata object from a dictionary of workflow metadata.

Parameters:

row (dict[str, str]) – Dictionary containing metadata for a workflow. This is typically a row from the input metadata CSV file.

Returns:

A GCMSMetabWorkflowMetadata object populated with data from the input dictionary.

Return type:

GCMSMetabWorkflowMetadata

Notes

The input dictionary is expected to contain the following keys: ‘Processed Data Directory’, ‘Raw Data File’, ‘Raw Data Object Alt Id’, ‘mass spec configuration name’, ‘lc config name’, ‘instrument used’, ‘instrument analysis start date’, ‘instrument analysis end date’, ‘execution resource’.

generate_calibration(calibration_object: dict, CLIENT_ID: str, CLIENT_SECRET: str, fames: bool = True, internal: bool = False) CalibrationInformation[source]

Generate a CalibrationInformation object for the NMDC Database.

Parameters:
  • calibration_object (dict) – The calibration data object.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

  • fames (bool, optional) – Whether the calibration is for FAMES. Default is True.

  • internal (bool, optional) – Whether the calibration is internal. Default is False.

Returns:

A CalibrationInformation object for the NMDC Database.

Return type:

nmdc.CalibrationInformation

Notes

This method generates a CalibrationInformation object based on the calibration data object and the calibration type.

Raises:

ValueError – If the calibration type is not supported.

generate_calibration_id(metadata_df: DataFrame, nmdc_database_inst: Database, CLIENT_ID: str, CLIENT_SECRET: str) None[source]

Generate calibration information and data objects for each calibration file.

Parameters:
  • metadata_df (pd.DataFrame) – The metadata DataFrame.

  • nmdc_database_inst (nmdc.Database) – The NMDC Database instance.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

Return type:

None

generate_metab_identifications(processed_data_file: str) List[MetaboliteIdentification][source]

Generate MetaboliteIdentification objects from processed data file.

Parameters:

processed_data_file (str) – Path to the processed data file.

Returns:

List of MetaboliteIdentification objects generated from the processed data file.

Return type:

List[nmdc.MetaboliteIdentification]

Notes

This method reads in the processed data file and generates MetaboliteIdentification objects, pulling out the best hit for each peak based on the highest “Similarity Score”.

mass_spec_desc: str = 'Generation of mass spectrometry data by GC/MS for the analysis of metabolites.'
mass_spec_eluent_intro: str = 'gas_chromatography'
processed_data_category: str = 'processed_data'
processed_data_object_description: str = 'Metabolomics annotations as a result of a GC/MS metabolomics workflow activity.'
processed_data_object_type: str = 'GC-MS Metabolomics Results'
raw_data_obj_desc: str = 'GC/MS low resolution raw data for metabolomics data acquisition.'
raw_data_obj_type: str = 'GC-MS Raw Data'
rerun() None[source]

Execute a re run of the metadata generation process for GC/MS metabolomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 3. Load and process metadata to create NMDC objects. 4. Generate Metabolomics Analysis and Processed Data objects. 5. Update outputs for the Metabolomics Analysis object. 6. Append generated objects to the NMDC Database. 7. Dump the NMDC Database to a JSON file. 8. Validate the JSON file using the NMDC API.

Return type:

None

Raises:

FileNotFoundError – If the metadata file is not found.

Notes

This method uses tqdm to display progress bars for the processing of calibration information and mass spectrometry metadata.

run() None[source]

Execute the metadata generation process for GC/MS metabolomics data.

This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Generate calibration information and data objects for each calibration file. 3. Load and process metadata to create NMDC objects. 4. Generate Mass Spectrometry (including metabolite identifications), Raw Data, Metabolomics Analysis, and Processed Data objects. 5. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 6. Append generated objects to the NMDC Database. 7. Dump the NMDC Database to a JSON file. 8. Validate the JSON file using the NMDC API.

Return type:

None

Raises:

ValueError – If the calibration standard is not supported.

Notes

This method uses tqdm to display progress bars for the processing of calibration information and mass spectrometry metadata.

unique_columns: List[str] = ['raw_data_file', 'processed_data_file']
workflow_analysis_name: str = 'GC/MS Metabolomics analysis'
workflow_category: str = 'gc_ms_metabolomics'
workflow_description: str = 'Analysis of raw mass spectrometry data for the annotation of metabolites.'
workflow_git_url: str = 'https://github.com/microbiomedata/metaMS/wdl/metaMS_gcms.wdl'
workflow_version: str = '3.0.0'

NOM Metadata Generator Subclass

class src.nom_metadata_generator.NOMMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None)[source]

Bases: NMDCMetadataGenerator

A class for generating NMDC metadata objects using provided metadata files and configuration for Natural Organic Matter (NOM) data. :param metadata_file: Path to the input CSV metadata file. :type metadata_file: str :param database_dump_json_path: Path where the output database dump JSON file will be saved. :type database_dump_json_path: str :param raw_data_url: Base URL for the raw data files. :type raw_data_url: str :param process_data_url: Base URL for the processed data files. :type process_data_url: str :param minting_config_creds: Path to the configuration file containing the client ID and client secret for minting NMDC IDs. It can also include the bio ontology API key if generating biosample ids is needed.

If not provided, the CLIENT_ID, CLIENT_SECRET, and BIO_API_KEY environment variables will be used.

raw_data_object_type

The type of the raw data object.

Type:

str

processed_data_object_type

The type of the processed data object.

Type:

str

processed_data_category

The category of the processed data.

Type:

str

execution_resource

The execution resource for the workflow.

Type:

str

analyte_category

The category of the analyte.

Type:

str

workflow_analysis_name

The name of the workflow analysis.

Type:

str

workflow_description

The description of the workflow.

Type:

str

workflow_param_data_category

The category of the workflow parameter data.

Type:

str

workflow_param_data_object_type

The type of the workflow parameter data object.

Type:

str

unique_columns

List of unique columns in the metadata file.

Type:

list[str]

mass_spec_desc

The description of the mass spectrometry data.

Type:

str

mass_spec_eluent_intro

The introduction to the mass spectrometry eluent.

Type:

str

processing_institution

The institution responsible for processing the data.

Type:

str

workflow_git_url

The URL of the workflow Git repository.

Type:

str

workflow_version

The version of the workflow.

Type:

str

analyte_category: str = 'nom'
execution_resource: str = 'EMSL-RZR'
generate_nom_analysis(file_path: Path, raw_data_id: str, data_gen_id: str, processed_data_id: str, CLIENT_ID: str, CLIENT_SECRET: str, calibration_id: str = None, incremented_id: str = None) NomAnalysis[source]

Generate a metabolomics analysis object from the provided file information.

Parameters:
  • file_path (Path) – The file path of the metabolomics analysis data file.

  • raw_data_id (str) – The ID of the raw data associated with the analysis.

  • data_gen_id (str) – The ID of the data generation process that informed this analysis.

  • processed_data_id (str) – The ID of the processed data resulting from this analysis.

  • CLIENT_ID (str) – The client ID for the NMDC API.

  • CLIENT_SECRET (str) – The client secret for the NMDC API.

  • calibration_id (str, optional) – The ID of the calibration object used in the analysis. If None, no calibration is used.

  • incremented_id (str, optional) – The incremented ID for the metabolomics analysis. If None, a new ID will be minted.

Returns:

The generated metabolomics analysis object.

Return type:

nmdc.NomAnalysis

get_calibration_id(calibration_path: str) str[source]

Get the calibration ID from the NMDC API using the md5 checksum of the calibration file.

Parameters:

calibration_path (str) – The file path of the calibration file.

Returns:

The calibration ID if found, otherwise None.

Return type:

str

mass_spec_desc: str = 'ultra high resolution mass spectrum'
mass_spec_eluent_intro: str = 'direct_infusion_autosampler'
processed_data_category: str = 'processed_data'
processed_data_object_type: str = 'FT ICR-MS Analysis Results'
processing_institution: str = 'EMSL'
raw_data_object_type: str = 'Direct Infusion FT ICR-MS Raw Data'
rerun()[source]

Execute a rerun of the metadata generation process.

This method processes the metadata file, generates biosamples (if needed) and metadata, and manages the workflow for generating NOM analysis data.

Assumes raw data for NOM are on minio and that the raw data object URL field is populated.

run()[source]

Execute the metadata generation process.

This method processes the metadata file, generates biosamples (if needed) and metadata, and manages the workflow for generating NOM analysis data.

unique_columns: list[str] = ['raw_data_file', 'processed_data_directory']
workflow_analysis_name: str = 'NOM Analysis'
workflow_description: str = 'Natural Organic Matter analysis of raw mass spectrometry data.'
workflow_git_url: str = 'https://github.com/microbiomedata/enviroMS'
workflow_param_data_category: str = 'workflow_parameter_data'
workflow_param_data_object_type: str = 'Analysis Tool Parameter File'
workflow_version: str = '4.3.1'

Main CLI Class