This page provides documentation for the functions in the package.
Bio Ontology API Class
- class src.bio_ontology_api.BioOntologyInfoRetriever(bio_api_key: str)[source]
Bases:
object
Client for retrieving ENVO term information from BioPortal API.
A class to handle authentication and retrieval of Environmental Ontology (ENVO) terms using the BioPortal REST API service.
- Parameters:
bio_api_key (str) – The BioPortal BioOntology API key for authentication.
Notes
The configuration file should contain an ‘api_key’ field with a valid BioPortal API key.
Examples
>>> retriever = BioOntologyInfoRetriever('config.yaml') >>> envo_terms = retriever.get_envo_terms('ENVO:00002042') >>> print(envo_terms) {'ENVO:00002042': 'surface water'}
- get_envo_terms(envo_id: dict) dict [source]
Look up an ENVO term label using BioPortal API.
- Parameters:
envo_id (dict) – The ENVO identifier to look up (e.g., ‘ENVO:00002042’)
- Returns:
Dictionary with envo_id as key and term label as value Example: {‘ENVO:00002042’: ‘surface water’}
- Return type:
dict
Notes
Makes an authenticated request to BioPortal API to retrieve the preferred label (prefLabel) for the given ENVO term.
Metadata Parser Base Class
- class src.metadata_parser.MetadataParser[source]
Bases:
object
Parsers metadata from input metadata spreadsheet.
- create_controlled_identified_term_value(row_value: str, slot_enum_dict: dict) dict [source]
Create a controlled identified term value.
- Parameters:
row_value (str) – The raw value to be converted.
slot_enum_dict (dict) – A dictionary mapping the raw value to its corresponding term.
- Returns:
A dictionary representing the controlled identified term.
- Return type:
dict
- create_geo_loc_value(raw_value: str) dict [source]
Create a geolocation value representation.
- Parameters:
raw_value (str) – The raw value associated with geolocation.
- Returns:
A dictionary representing the geolocation value.
- Return type:
dict
- create_quantity_value(value_dict: dict = None) dict [source]
Create a quantity value representation. Since a dictionary is passed in, we need to check if any of the values are None and remove them if so. Also adds the Quantity value type.
- Parameters:
value_dict (dict) –
A dictionary containing the raw value and other attributes gathered from the metadata. This is a dict of the form: {
”has_numeric_value”: float, “has_minimum_numeric_value”: float, “has_maximum_numeric_value”: float, “has_unit”: str, “has_raw_value”: str
} The keys in the dictionary are the attributes of the QuantityValue class. They may be passed in as None if they are not present in the metadata.
- Returns:
A dictionary representing the quantity value.
- Return type:
dict
- create_text_value(row_value: str, is_list: bool) dict [source]
Create a text value representation.
- Parameters:
row_value (str) – The raw value to convert.
is_list (bool) – Whether to treat the value as a list.
- Returns:
A dictionary representing the text value.
- Return type:
dict
- create_timestamp_value(raw_value: str) dict [source]
Create a timestamp value representation.
- Parameters:
raw_value (str) – The raw value to convert to a timestamp.
- Returns:
A dictionary representing the timestamp value.
- Return type:
dict
- dynam_parse_biosample_metadata(row: Series, bio_api_key: str) dict [source]
Function to parse the metadata row if it includes biosample information. This pulls the most recent version of the ontology terms from the API and compares them to the values in the given row. Different parsing is done on different types of fields, such as lists, controlled identified terms, and text values to ensure the correct format is used.
- Parameters:
row (pd.Series) – A row from the DataFrame containing metadata.
bio_api_key (str) – The API key to access the Bio Ontology API
- Returns:
metadata – The metadata dictionary.
- Return type:
dict
- generate_example_biosample_csv(file_path: str = 'example_biosample_metadata.csv')[source]
Function to generate an example csv file from available NMDCSchema Biosample fields. Saves the file to the given path.
- Parameters:
file_path (str) – The path to save the example CSV file. Default is “example_biosample_metadata.csv”.
- Return type:
None
- get_value(row: Series, key: str, default: str = None) str [source]
Retrieve a value from a row, handling missing or NaN values.
- Parameters:
row (pd.Series) – A row from the DataFrame.
key (str) – The key to retrieve the value for.
default (str, optional) – Default value to return if the key does not exist or is NaN.
- Returns:
The value associated with the key, or default if not found.
- Return type:
str
Metadata Generator Base Class
- class src.metadata_generator.NMDCMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str)[source]
Bases:
ABC
Abstract class for generating NMDC metadata objects using provided metadata files and configuration.
- Parameters:
metadata_file (str) – Path to the input CSV metadata file.
database_dump_json_path (str) – Path where the output database dump JSON file will be saved.
raw_data_url (str) – Base URL for the raw data files.
process_data_url (str) – Base URL for the processed data files.
- check_doj_urls(metadata_df: DataFrame, url_columns: List) None [source]
Check if the URLs in the input list already exist in the database.
- Parameters:
metadata_df (pd.DataFrame) – The DataFrame containing the metadata information.
url_columns (List) – The list of columns in the DataFrame that contain URLs to check.
- Return type:
None
- Raises:
ValueError – If any URL in the metadata DataFrame is invalid or inaccessible.
FileNotFoundError – If no files are found in the specified directory columns.
- check_for_biosamples(metadata_df: DataFrame, nmdc_database_inst: Database, CLIENT_ID: str, CLIENT_SECRET: str) None [source]
This method verifies the presence of the ‘biosample_id’ in the provided metadata DataFrame. It will loop over each row to verify the presence of the ‘biosample_id’, giving the option for some rows to need generation and some to already exist. If the ‘biosample_id’ is missing, it checks for the presence of required columns to generate a new biosample_id using the NMDC API. If they are all there, the function calls the dynam_parse_biosample_metadata method from the MetadataParser class to create the JSON for the biosample. If the required columns are missing and there is no biosample_id - it raises a ValueError. After the biosample_id is generated,it updates the DataFrame row with the newly minted biosample_id and the NMDC database instance with the new biosample JSON.
- Parameters:
metadata_df (pd.DataFrame) – the dataframe containing the metadata information.
nmdc_database_inst (nmdc.Database) – The NMDC Database instance to add the biosample to if one needs to be generated.
CLIENT_ID (str) – The client ID for the NMDC API. Used to mint a biosmaple id if one does not exist.
CLIENT_SECRET (str) – The client secret for the NMDC API. Used to mint a biosmaple id if one does not exist.
- Return type:
None
- Raises:
ValueError – If the ‘biosample.name’ column is missing and ‘biosample_id’ is empty. If any required columns for biosample generation are missing.
- clean_dict(dict: Dict) Dict [source]
Clean the dictionary by removing keys with empty or None values.
- Parameters:
dict (Dict) – The dictionary to be cleaned.
- Returns:
A new dictionary with keys removed where the values are None, an empty string, or a string with only whitespace.
- Return type:
Dict
- dump_nmdc_database(nmdc_database: Database) None [source]
Dump the NMDC database to a JSON file.
This method serializes the NMDC Database instance to a JSON file at the specified path.
- Parameters:
nmdc_database (nmdc.Database) – The NMDC Database instance to dump.
- Returns:
None
Side Effects
————
Writes the database content to the file specified by
self.database_dump_json_path.
- generate_biosample(biosamp_metadata: dict, CLIENT_ID: str, CLIENT_SECRET: str) Biosample [source]
Mint a biosample id from the given metadata and create a biosample instance.
- Parameters:
biosamp_metadata (dict) – The metadata object containing biosample information.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
- Returns:
The generated biosample instance.
- Return type:
nmdc.Biosample
- generate_data_object(file_path: Path, data_category: str, data_object_type: str, description: str, base_url: str, CLIENT_ID: str, CLIENT_SECRET: str, was_generated_by: str = None, alternative_id: str = None) DataObject [source]
Create an NMDC DataObject with metadata from the specified file and details.
This method generates an NMDC DataObject and assigns it a unique NMDC ID. The DataObject is populated with metadata derived from the provided file and input parameters.
- Parameters:
file_path (Path) – Path to the file representing the data object. The file’s name is used as the name attribute.
data_category (str) – Category of the data object (e.g., ‘instrument_data’).
data_object_type (str) – Type of the data object (e.g., ‘LC-DDA-MS/MS Raw Data’).
description (str) – Description of the data object.
base_url (str) – Base URL for accessing the data object, to which the file name is appended to form the complete URL.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
was_generated_by (str, optional) – ID of the process or entity that generated the data object (e.g., the DataGeneration id or the MetabolomicsAnalysis id).
alternative_id (str, optional) – An optional alternative identifier for the data object.
- Returns:
An NMDC DataObject instance with the specified metadata.
- Return type:
nmdc.DataObject
Notes
This method calculates the MD5 checksum of the file, which may be time-consuming for large files.
- generate_mass_spectrometry(file_path: Path, instrument_name: str, sample_id: str, raw_data_id: str, study_id: str, processing_institution: str, mass_spec_config_name: str, start_date: str, end_date: str, CLIENT_ID: str, CLIENT_SECRET: str, lc_config_name: str = None, calibration_id: str = None) DataGeneration [source]
Create an NMDC DataGeneration object for mass spectrometry and mint an NMDC ID.
- Parameters:
file_path (Path) – File path of the mass spectrometry data.
instrument_name (str) – Name of the instrument used for data generation.
sample_id (str) – ID of the input sample associated with the data generation.
raw_data_id (str) – ID of the raw data object associated with the data generation.
study_id (str) – ID of the study associated with the data generation.
processing_institution (str) – Name of the processing institution.
mass_spec_config_name (str) – Name of the mass spectrometry configuration.
start_date (str) – Start date of the data generation.
end_date (str) – End date of the data generation.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
lc_config_name (str) – Name of the liquid chromatography configuration.
calibration_id (str, optional) – ID of the calibration information generated with the data. Default is None, indicating no calibration information.
- Returns:
An NMDC DataGeneration object with the provided metadata.
- Return type:
nmdc.DataGeneration
Notes
This method uses the nmdc_api_utilities package to fetch IDs for the instrument and configurations. It also mints a new NMDC ID for the DataGeneration object.
- generate_metabolomics_analysis(cluster_name: str, raw_data_name: str, raw_data_id: str, data_gen_id: str, processed_data_id: str, parameter_data_id: str, processing_institution: str, CLIENT_ID: str, CLIENT_SECRET: str, calibration_id: str = None, incremeneted_id: str = None, metabolite_identifications: List[MetaboliteIdentification] = None, type: str = 'nmdc:MetabolomicsAnalysis') MetabolomicsAnalysis [source]
Create an NMDC MetabolomicsAnalysis object with metadata for a workflow analysis.
This method generates an NMDC MetabolomicsAnalysis object, including details about the analysis, the processing institution, and relevant workflow information.
- Parameters:
cluster_name (str) – Name of the cluster or computing resource used for the analysis.
raw_data_name (str) – Name of the raw data file that was analyzed.
raw_data_id (str) – ID of the raw data object that was analyzed.
data_gen_id (str) – ID of the DataGeneration object that generated the raw data.
processed_data_id (str) – ID of the processed data resulting from the analysis.
parameter_data_id (str) – ID of the parameter data object used for the analysis.
processing_institution (str) – Name of the institution where the analysis was performed.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
calibration_id (str, optional) – ID of the calibration information used for the analysis. Default is None, indicating no calibration information.
incremeneted_id (str, optional) – An optional incremented ID for the MetabolomicsAnalysis object. If not provided, a new NMDC ID will be minted.
metabolite_identifications (List[nmdc.MetaboliteIdentification], optional) – List of MetaboliteIdentification objects associated with the analysis. Default is None, which indicates no metabolite identifications.
type (str, optional) – The type of the analysis. Default is NmdcTypes.MetabolomicsAnalysis.
- Returns:
An NMDC MetabolomicsAnalysis instance with the provided metadata.
- Return type:
nmdc.MetabolomicsAnalysis
Notes
The ‘started_at_time’ and ‘ended_at_time’ fields are initialized with placeholder values and should be updated with actual timestamps later when the processed files are iterated over in the run method.
- handle_biosample(row: Series) tuple [source]
Process biosample information from metadata row.
Checks if a biosample ID exists in the row. If it does, returns the existing biosample information. If not, generates a new biosample.
- Parameters:
row (pd.Series) – A row from the metadata DataFrame containing biosample information
- Returns:
A tuple containing: - emsl_metadata : Dict
Parsed metadata from input csv row
- biosample_idstr
The ID of the biosample (existing or newly generated)
- Return type:
tuple
- load_bio_credentials(config_file: str = None) str [source]
Load bio ontology API key from the environment or a configuration file.
- Parameters:
config_file (str) – The path to the configuration file.
- Returns:
The bio ontology API key.
- Return type:
str
- Raises:
FileNotFoundError – If the configuration file is not found, and the API key is not set in the environment.
ValueError – If the configuration file is not valid or does not contain the API key.
- load_credentials(config_file: str = None) tuple [source]
Load the client ID and secret from the environment or a configuration file.
- Parameters:
config_file (str) – The path to the configuration file.
- Returns:
A tuple containing the client ID and client secret.
- Return type:
tuple
- load_metadata() DataFrame [source]
Load and group workflow metadata from a CSV file.
This method reads the metadata CSV file, checks for uniqueness in specified columns, checks that biosamples exist, and groups the data by biosample ID.
- Returns:
A DataFrame containing the loaded and grouped metadata.
- Return type:
pd.core.frame.DataFrame
- Raises:
FileNotFoundError – If the metadata_file does not exist.
ValueError – If values in columns ‘Raw Data File’, and ‘Processed Data Directory’ are not unique.
Notes
See example_metadata_file.csv in this directory for an example of the expected input file format.
- start_nmdc_database() Database [source]
Initialize and return a new NMDC Database instance.
- Returns:
A new instance of an NMDC Database.
- Return type:
nmdc.Database
Notes
This method simply creates and returns a new instance of the NMDC Database. It does not perform any additional initialization or configuration.
- update_outputs(analysis_obj: object, raw_data_obj_id: str, parameter_data_id: str, processed_data_id_list: list, mass_spec_obj: object = None, rerun: bool = False) None [source]
Update output references for Mass Spectrometry and Workflow Analysis objects.
This method assigns the output references for a Mass Spectrometry object and a Workflow Execution Analysis object. It sets mass_spec_obj.has_output to the ID of raw_data_obj and analysis_obj.has_output to a list of processed data IDs.
- Parameters:
analysis_obj (object) – The Workflow Execution Analysis object to update (e.g., MetabolomicsAnalysis).
raw_data_obj_id (str) – The Raw Data Object associated with the Mass Spectrometry.
parameter_data_id (str) – ID of the data object representing the parameter data used for the analysis.
processed_data_id_list (list) – List of IDs representing processed data objects associated with the Workflow Execution.
mass_spec_obj (object , optional) – The Mass Spectrometry object to update. Optional for rerun cases.
rerun (bool, optional) – If True, this indicates the run is a rerun, and the method will not set mass_spec_obj.has_output because there is not one. Default is False.
- Return type:
None
Notes
Sets mass_spec_obj.has_output to [raw_data_obj.id].
Sets analysis_obj.has_output to processed_data_id_list.
LC/MS Metadata Generator Base Class
- class src.lcms_metadata_generator.LCMSMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str)[source]
Bases:
NMDCMetadataGenerator
A class for generating NMDC metadata objects using provided metadata files and configuration for LC-MS data.
This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format.
- create_workflow_metadata(row: dict[str, str]) LCMSLipidWorkflowMetadata [source]
Create a LCMSLipidWorkflowMetadata object from a dictionary of workflow metadata.
- Parameters:
row (dict[str, str]) – Dictionary containing metadata for a workflow. This is typically a row from the input metadata CSV file.
- Returns:
A LCMSLipidWorkflowMetadata object populated with data from the input dictionary.
- Return type:
Notes
The input dictionary is expected to contain the following keys: ‘Processed Data Directory’, ‘Raw Data File’, ‘Raw Data Object Alt Id’, ‘mass spec configuration name’, ‘lc config name’, ‘instrument used’, ‘instrument analysis start date’, ‘instrument analysis end date’, ‘execution resource’.
- rerun() None [source]
Execute a rerun of the metadata generation process for metabolomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Metabolomics Analysis and Processed Data objects. 4. Update outputs for the Metabolomics Analysis object. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the processed data directory is empty or not found.
ValueError – If the number of files in the processed data directory is not as expected.
Notes
This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.
- run() None [source]
Execute the metadata generation process for lipidomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Mass Spectrometry, Raw Data, Metabolomics Analysis, and Processed Data objects. 4. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the processed data directory is empty or not found.
ValueError – If the number of files in the processed data directory is not as expected
Notes
This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.
Data Class
- class src.data_classes.NmdcTypes(Biosample: str = 'nmdc:Biosample', MassSpectrometry: str = 'nmdc:MassSpectrometry', MetabolomicsAnalysis: str = 'nmdc:MetabolomicsAnalysis', DataObject: str = 'nmdc:DataObject', CalibrationInformation: str = 'nmdc:CalibrationInformation', MetaboliteIdentification: str = 'nmdc:MetaboliteIdentification', NomAnalysis: str = 'nmdc:NomAnalysis', OntologyClass: str = 'nmdc:OntologyClass', ControlledIdentifiedTermValue: str = 'nmdc:ControlledIdentifiedTermValue', TextValue: str = 'nmdc:TextValue', GeolocationValue: str = 'nmdc:GeolocationValue', TimeStampValue: str = 'nmdc:TimestampValue', QuantityValue: str = 'nmdc:QuantityValue')[source]
Bases:
object
Data class holding NMDC type constants.
- Biosample
NMDC type for Biosample.
- Type:
str
- MassSpectrometry
NMDC type for Mass Spectrometry.
- Type:
str
- MetabolomicsAnalysis
NMDC type for Metabolomics Analysis.
- Type:
str
- DataObject
NMDC type for Data Object.
- Type:
str
- CalibrationInformation
NMDC type for Calibration Information.
- Type:
str
- MetaboliteIdentification
NMDC type for Metabolite Identification.
- Type:
str
- NomAnalysis
NMDC type for NOM Analysis.
- Type:
str
- OntologyClass
NMDC type for Ontology Class.
- Type:
str
- ControlledIdentifiedTermValue
NMDC type for Controlled Identified Term Value.
- Type:
str
- TextValue
NMDC type for Text Value.
- Type:
str
- GeolocationValue
NMDC type for Geolocation Value.
- Type:
str
- TimeStampValue
NMDC type for Timestamp Value.
- Type:
str
- QuantityValue
NMDC type for Quantity Value.
- Type:
str
- Biosample: str = 'nmdc:Biosample'
- CalibrationInformation: str = 'nmdc:CalibrationInformation'
- ControlledIdentifiedTermValue: str = 'nmdc:ControlledIdentifiedTermValue'
- DataObject: str = 'nmdc:DataObject'
- GeolocationValue: str = 'nmdc:GeolocationValue'
- MassSpectrometry: str = 'nmdc:MassSpectrometry'
- MetaboliteIdentification: str = 'nmdc:MetaboliteIdentification'
- MetabolomicsAnalysis: str = 'nmdc:MetabolomicsAnalysis'
- NomAnalysis: str = 'nmdc:NomAnalysis'
- OntologyClass: str = 'nmdc:OntologyClass'
- QuantityValue: str = 'nmdc:QuantityValue'
- TextValue: str = 'nmdc:TextValue'
- TimeStampValue: str = 'nmdc:TimestampValue'
- class src.data_classes.GCMSMetabWorkflowMetadata(biosample_id: str, nmdc_study: str, processing_institution: str, processed_data_file: str, raw_data_file: str, mass_spec_config_name: str, chromat_config_name: str, instrument_used: str, instrument_analysis_start_date: str, instrument_analysis_end_date: str, execution_resource: float, calibration_id: str)[source]
Bases:
object
Data class for holding GCMS metabolomic workflow metadata information.
- biosample_id
Identifier for the biosample.s
- Type:
str
- nmdc_study
Identifier for the NMDC study.
- Type:
str
- processing_institution
Name of the institution processing the data.
- Type:
str
- processed_data_file
Path or name of the processed data file.
- Type:
str
- raw_data_file
Path or name of the raw data file.
- Type:
str
- mass_spec_config_name
Name of the mass spectrometry configuration used.
- Type:
str
- chromat_config_name
Name of the chromatography configuration used.
- Type:
str
- instrument_used
Name of the instrument used for analysis.
- Type:
str
- instrument_analysis_start_date
Start date of the instrument analysis.
- Type:
str
- instrument_analysis_end_date
End date of the instrument analysis.
- Type:
str
- execution_resource
Identifier for the execution resource.
- Type:
float
- calibration_id
Identifier for the calibration information used.
- Type:
str
- biosample_id: str
- calibration_id: str
- chromat_config_name: str
- execution_resource: float
- instrument_analysis_end_date: str
- instrument_analysis_start_date: str
- instrument_used: str
- mass_spec_config_name: str
- nmdc_study: str
- processed_data_file: str
- processing_institution: str
- raw_data_file: str
- class src.data_classes.LCMSLipidWorkflowMetadata(processed_data_dir: str, raw_data_file: str, mass_spec_config_name: str, lc_config_name: str, instrument_used: str, instrument_analysis_start_date: str, instrument_analysis_end_date: str, execution_resource: float)[source]
Bases:
object
Data class for holding LC-MS lipidomics workflow metadata information.
- processed_data_dir
Directory containing processed data files.
- Type:
str
- raw_data_file
Path or name of the raw data file.
- Type:
str
- mass_spec_config_name
Name of the mass spectrometry configuration used.
- Type:
str
- lc_config_name
Name of the liquid chromatography configuration used.
- Type:
str
- instrument_used
Name of the instrument used for analysis.
- Type:
str
- instrument_analysis_start_date
Start date of the instrument analysis.
- Type:
str
- instrument_analysis_end_date
End date of the instrument analysis.
- Type:
str
- execution_resource
Identifier for the execution resource.
- Type:
float
- execution_resource: float
- instrument_analysis_end_date: str
- instrument_analysis_start_date: str
- instrument_used: str
- lc_config_name: str
- mass_spec_config_name: str
- processed_data_dir: str
- raw_data_file: str
LC/MS Lipidomics Metadata Generator Subclass
- class src.lcms_lipid_metadata_generator.LCMSLipidomicsMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None)[source]
Bases:
LCMSMetadataGenerator
A class for generating NMDC metadata objects using provided metadata files and configuration for LC-MS lipidomics data.
This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format. :param metadata_file: Path to the input CSV metadata file. :type metadata_file: str :param database_dump_json_path: Path where the output database dump JSON file will be saved. :type database_dump_json_path: str :param raw_data_url: Base URL for the raw data files. :type raw_data_url: str :param process_data_url: Base URL for the processed data files. :type process_data_url: str :param minting_config_creds: Path to the configuration file containing the client ID and client secret for minting NMDC IDs. It can also include the bio ontology API key if generating biosample ids is needed.
If not provided, the CLIENT_ID, CLIENT_SECRET, and BIO_API_KEY environment variables will be used.
- unique_columns
List of unique columns in the metadata file.
- Type:
List[str]
- mass_spec_desc
Description of the mass spectrometry analysis.
- Type:
str
- mass_spec_eluent_intro
Eluent introduction category for mass spectrometry.
- Type:
str
- analyte_category
Category of the analyte.
- Type:
str
- raw_data_obj_type
Type of the raw data object.
- Type:
str
- raw_data_obj_desc
Description of the raw data object.
- Type:
str
- workflow_analysis_name
Name of the workflow analysis.
- Type:
str
- workflow_description
Description of the workflow.
- Type:
str
- workflow_git_url
URL of the workflow’s Git repository.
- Type:
str
- workflow_version
Version of the workflow.
- Type:
str
- workflow_category
Category of the workflow.
- Type:
str
- wf_config_process_data_category
Category of the workflow configuration process data.
- Type:
str
- wf_config_process_data_obj_type
Type of the workflow configuration process data object.
- Type:
str
- wf_config_process_data_description
Description of the workflow configuration process data.
- Type:
str
- no_config_process_data_category
Category for processed data without configuration.
- Type:
str
- no_config_process_data_obj_type
Type of processed data object without configuration.
- Type:
str
- csv_process_data_description
Description of CSV processed data.
- Type:
str
- hdf5_process_data_obj_type
Type of HDF5 processed data object.
- Type:
str
- hdf5_process_data_description
Description of HDF5 processed data.
- Type:
str
- analyte_category: str = 'lipidome'
- csv_process_data_description: str = 'Lipid annotations as a result of a lipidomics workflow activity.'
- hdf5_process_data_description: str = 'CoreMS hdf5 file representing a lipidomics data file including annotations.'
- hdf5_process_data_obj_type: str = 'LC-MS Lipidomics Processed Data'
- mass_spec_desc: str = 'Generation of mass spectrometry data for the analysis of lipids.'
- mass_spec_eluent_intro: str = 'liquid_chromatography'
- no_config_process_data_category: str = 'processed_data'
- no_config_process_data_obj_type: str = 'LC-MS Lipidomics Results'
- raw_data_obj_desc: str = 'LC-DDA-MS/MS raw data for lipidomics data acquisition.'
- raw_data_obj_type: str = 'LC-DDA-MS/MS Raw Data'
- rerun()[source]
Execute a rerun of the metadata generation process for metabolomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Metabolomics Analysis and Processed Data objects. 4. Update outputs for the Metabolomics Analysis object. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the processed data directory is empty or not found.
ValueError – If the number of files in the processed data directory is not as expected.
Notes
This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.
- run()[source]
Execute the metadata generation process for lipidomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Mass Spectrometry, Raw Data, Metabolomics Analysis, and Processed Data objects. 4. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the processed data directory is empty or not found.
ValueError – If the number of files in the processed data directory is not as expected
Notes
This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.
- unique_columns: List[str] = ['raw_data_file', 'processed_data_directory']
- wf_config_process_data_category: str = 'workflow_parameter_data'
- wf_config_process_data_description: str = 'CoreMS parameters used for Lipidomics workflow.'
- wf_config_process_data_obj_type: str = 'Configuration toml'
- workflow_analysis_name: str = 'Lipidomics analysis'
- workflow_category: str = 'lc_ms_lipidomics'
- workflow_description: str = 'Analysis of raw mass spectrometry data for the annotation of lipids.'
- workflow_git_url: str = 'https://github.com/microbiomedata/metaMS/wdl/metaMS_lipidomics.wdl'
- workflow_version: str = '1.0.0'
LC/MS Metabolomics Metadata Generator Subclass
- class src.lcms_metab_metadata_generator.LCMSMetabolomicsMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None)[source]
Bases:
LCMSMetadataGenerator
A class for generating NMDC metadata objects using provided metadata files and configuration for LC-MS metabolomics data.
This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format. :param metadata_file: Path to the input CSV metadata file. :type metadata_file: str :param database_dump_json_path: Path where the output database dump JSON file will be saved. :type database_dump_json_path: str :param raw_data_url: Base URL for the raw data files. :type raw_data_url: str :param process_data_url: Base URL for the processed data files. :type process_data_url: str :param minting_config_creds: Path to the configuration file containing the client ID and client secret for minting NMDC IDs. It can also include the bio ontology API key if generating biosample ids is needed.
If not provided, the CLIENT_ID, CLIENT_SECRET, and BIO_API_KEY environment variables will be used.
- unique_columns
List of unique columns in the metadata file.
- Type:
list[str]
- mass_spec_desc
Description of the mass spectrometry analysis.
- Type:
str
- mass_spec_eluent_intro
Eluent introduction category for mass spectrometry.
- Type:
str
- analyte_category
Category of the analyte.
- Type:
str
- raw_data_obj_type
Type of the raw data object.
- Type:
str
- raw_data_obj_desc
Description of the raw data object.
- Type:
str
- workflow_analysis_name
Name of the workflow analysis.
- Type:
str
- workflow_description
Description of the workflow.
- Type:
str
- workflow_git_url
URL of the workflow’s Git repository.
- Type:
str
- workflow_version
Version of the workflow.
- Type:
str
- workflow_category
Category of the workflow.
- Type:
str
- wf_config_process_data_category
Category of the workflow configuration process data.
- Type:
str
- wf_config_process_data_obj_type
Type of the workflow configuration process data object.
- Type:
str
- wf_config_process_data_description
Description of the workflow configuration process data.
- Type:
str
- no_config_process_data_category
Category for processed data without configuration.
- Type:
str
- no_config_process_data_obj_type
Type of processed data object without configuration.
- Type:
str
- csv_process_data_description
Description of CSV processed data.
- Type:
str
- hdf5_process_data_obj_type
Type of HDF5 processed data object.
- Type:
str
- hdf5_process_data_description
Description of HDF5 processed data.
- Type:
str
- analyte_category: str = 'metabolome'
- csv_process_data_description: str = 'Metabolite annotations as a result of a metabolomics workflow activity.'
- hdf5_process_data_description: str = 'CoreMS hdf5 file representing a metabolomics data file including annotations.'
- hdf5_process_data_obj_type: str = 'LC-MS Metabolomics Processed Data'
- mass_spec_desc: str = 'Generation of mass spectrometry data for the analysis of metabolomics using liquid chromatography.'
- mass_spec_eluent_intro: str = 'liquid_chromatography'
- no_config_process_data_category: str = 'processed_data'
- no_config_process_data_obj_type: str = 'LC-MS Metabolomics Results'
- raw_data_obj_desc: str = 'LC-DDA-MS/MS raw data for metabolomics data acquisition.'
- raw_data_obj_type: str = 'LC-DDA-MS/MS Raw Data'
- rerun()[source]
Execute a rerun of the metadata generation process for metabolomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Metabolomics Analysis and Processed Data objects. 4. Update outputs for the Metabolomics Analysis object. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the processed data directory is empty or not found.
ValueError – If the number of files in the processed data directory is not as expected.
Notes
This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.
- run()[source]
Execute the metadata generation process for lipidomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Load and process metadata to create NMDC objects. 3. Generate Mass Spectrometry, Raw Data, Metabolomics Analysis, and Processed Data objects. 4. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 5. Append generated objects to the NMDC Database. 6. Dump the NMDC Database to a JSON file. 7. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the processed data directory is empty or not found.
ValueError – If the number of files in the processed data directory is not as expected
Notes
This method uses tqdm to display progress bars for the processing of biosamples and mass spectrometry metadata.
- unique_columns: list[str] = ['raw_data_file', 'processed_data_directory']
- wf_config_process_data_category: str = 'workflow_parameter_data'
- wf_config_process_data_description: str = 'CoreMS parameters used for metabolomics workflow.'
- wf_config_process_data_obj_type: str = 'Configuration toml'
- workflow_analysis_name: str = 'Metabolomics analysis'
- workflow_category: str = 'lc_ms_metabolomics'
- workflow_description: str = 'Analysis of raw mass spectrometry data for the annotation of metabolites.'
- workflow_git_url: str = 'https://github.com/microbiomedata/metaMS/wdl/metaMS_lcms_metabolomics.wdl'
- workflow_version: str = '1.0.0'
GC/MS Metabolomics Metadata Generator Subclass
- class src.gcms_metab_metadata_generator.GCMSMetabolomicsMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None, calibration_standard: str = 'fames', configuration_file_name: str = 'emsl_gcms_corems_params.toml')[source]
Bases:
NMDCMetadataGenerator
A class for generating NMDC metadata objects related to GC/MS metabolomics data.
This class processes input metadata files, generates various NMDC objects, and produces a database dump in JSON format.
- Parameters:
metadata_file (str) – Path to the metadata CSV file.
database_dump_json_path (str) – Path to the output JSON file for the NMDC database dump.
raw_data_url (str) – Base URL for the raw data files.
process_data_url (str) – Base URL for the processed data files.
minting_config_creds (str) – Path to the minting configuration credentials file.
calibration_standard (str, optional) – Calibration standard used for the data. Default is “fames”.
configuration_file_name (str, optional) – Name of the configuration file. Default is “emsl_gcms_corems_params.toml”.
- unique_columns
List of columns used to check for uniqueness in the metadata before processing.
- Type:
List[str]
- mass_spec_desc
Description of the mass spectrometry analysis.
- Type:
str
- mass_spec_eluent_intro
Eluent introduction category for mass spectrometry.
- Type:
str
- analyte_category
Category of the analyte.
- Type:
str
- raw_data_obj_type
Type of the raw data object.
- Type:
str
- raw_data_obj_desc
Description of the raw data object.
- Type:
str
- workflow_analysis_name
Name of the workflow analysis.
- Type:
str
- workflow_description
Description of the workflow.
- Type:
str
- workflow_git_url
URL of the workflow’s Git repository.
- Type:
str
- workflow_version
Version of the workflow.
- Type:
str
- workflow_category
Category of the workflow.
- Type:
str
- processed_data_category
Category of the processed data.
- Type:
str
- processed_data_object_type
Type of the processed data object.
- Type:
str
- processed_data_object_description
- Type:
str
- analyte_category: str = 'metabolome'
- create_workflow_metadata(row: dict[str, str]) GCMSMetabWorkflowMetadata [source]
Create a GCMSMetabWorkflowMetadata object from a dictionary of workflow metadata.
- Parameters:
row (dict[str, str]) – Dictionary containing metadata for a workflow. This is typically a row from the input metadata CSV file.
- Returns:
A GCMSMetabWorkflowMetadata object populated with data from the input dictionary.
- Return type:
Notes
The input dictionary is expected to contain the following keys: ‘Processed Data Directory’, ‘Raw Data File’, ‘Raw Data Object Alt Id’, ‘mass spec configuration name’, ‘lc config name’, ‘instrument used’, ‘instrument analysis start date’, ‘instrument analysis end date’, ‘execution resource’.
- generate_calibration(calibration_object: dict, CLIENT_ID: str, CLIENT_SECRET: str, fames: bool = True, internal: bool = False) CalibrationInformation [source]
Generate a CalibrationInformation object for the NMDC Database.
- Parameters:
calibration_object (dict) – The calibration data object.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
fames (bool, optional) – Whether the calibration is for FAMES. Default is True.
internal (bool, optional) – Whether the calibration is internal. Default is False.
- Returns:
A CalibrationInformation object for the NMDC Database.
- Return type:
nmdc.CalibrationInformation
Notes
This method generates a CalibrationInformation object based on the calibration data object and the calibration type.
- Raises:
ValueError – If the calibration type is not supported.
- generate_calibration_id(metadata_df: DataFrame, nmdc_database_inst: Database, CLIENT_ID: str, CLIENT_SECRET: str) None [source]
Generate calibration information and data objects for each calibration file.
- Parameters:
metadata_df (pd.DataFrame) – The metadata DataFrame.
nmdc_database_inst (nmdc.Database) – The NMDC Database instance.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
- Return type:
None
- generate_metab_identifications(processed_data_file: str) List[MetaboliteIdentification] [source]
Generate MetaboliteIdentification objects from processed data file.
- Parameters:
processed_data_file (str) – Path to the processed data file.
- Returns:
List of MetaboliteIdentification objects generated from the processed data file.
- Return type:
List[nmdc.MetaboliteIdentification]
Notes
This method reads in the processed data file and generates MetaboliteIdentification objects, pulling out the best hit for each peak based on the highest “Similarity Score”.
- mass_spec_desc: str = 'Generation of mass spectrometry data by GC/MS for the analysis of metabolites.'
- mass_spec_eluent_intro: str = 'gas_chromatography'
- processed_data_category: str = 'processed_data'
- processed_data_object_description: str = 'Metabolomics annotations as a result of a GC/MS metabolomics workflow activity.'
- processed_data_object_type: str = 'GC-MS Metabolomics Results'
- raw_data_obj_desc: str = 'GC/MS low resolution raw data for metabolomics data acquisition.'
- raw_data_obj_type: str = 'GC-MS Raw Data'
- rerun() None [source]
Execute a re run of the metadata generation process for GC/MS metabolomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 3. Load and process metadata to create NMDC objects. 4. Generate Metabolomics Analysis and Processed Data objects. 5. Update outputs for the Metabolomics Analysis object. 6. Append generated objects to the NMDC Database. 7. Dump the NMDC Database to a JSON file. 8. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
FileNotFoundError – If the metadata file is not found.
Notes
This method uses tqdm to display progress bars for the processing of calibration information and mass spectrometry metadata.
- run() None [source]
Execute the metadata generation process for GC/MS metabolomics data.
This method performs the following steps: 1. Initialize an NMDC Database instance. 2. Generate calibration information and data objects for each calibration file. 3. Load and process metadata to create NMDC objects. 4. Generate Mass Spectrometry (including metabolite identifications), Raw Data, Metabolomics Analysis, and Processed Data objects. 5. Update outputs for Mass Spectrometry and Metabolomics Analysis objects. 6. Append generated objects to the NMDC Database. 7. Dump the NMDC Database to a JSON file. 8. Validate the JSON file using the NMDC API.
- Return type:
None
- Raises:
ValueError – If the calibration standard is not supported.
Notes
This method uses tqdm to display progress bars for the processing of calibration information and mass spectrometry metadata.
- unique_columns: List[str] = ['raw_data_file', 'processed_data_file']
- workflow_analysis_name: str = 'GC/MS Metabolomics analysis'
- workflow_category: str = 'gc_ms_metabolomics'
- workflow_description: str = 'Analysis of raw mass spectrometry data for the annotation of metabolites.'
- workflow_git_url: str = 'https://github.com/microbiomedata/metaMS/wdl/metaMS_gcms.wdl'
- workflow_version: str = '3.0.0'
NOM Metadata Generator Subclass
- class src.nom_metadata_generator.NOMMetadataGenerator(metadata_file: str, database_dump_json_path: str, raw_data_url: str, process_data_url: str, minting_config_creds: str = None)[source]
Bases:
NMDCMetadataGenerator
A class for generating NMDC metadata objects using provided metadata files and configuration for Natural Organic Matter (NOM) data. :param metadata_file: Path to the input CSV metadata file. :type metadata_file: str :param database_dump_json_path: Path where the output database dump JSON file will be saved. :type database_dump_json_path: str :param raw_data_url: Base URL for the raw data files. :type raw_data_url: str :param process_data_url: Base URL for the processed data files. :type process_data_url: str :param minting_config_creds: Path to the configuration file containing the client ID and client secret for minting NMDC IDs. It can also include the bio ontology API key if generating biosample ids is needed.
If not provided, the CLIENT_ID, CLIENT_SECRET, and BIO_API_KEY environment variables will be used.
- raw_data_object_type
The type of the raw data object.
- Type:
str
- processed_data_object_type
The type of the processed data object.
- Type:
str
- processed_data_category
The category of the processed data.
- Type:
str
- execution_resource
The execution resource for the workflow.
- Type:
str
- analyte_category
The category of the analyte.
- Type:
str
- workflow_analysis_name
The name of the workflow analysis.
- Type:
str
- workflow_description
The description of the workflow.
- Type:
str
- workflow_param_data_category
The category of the workflow parameter data.
- Type:
str
- workflow_param_data_object_type
The type of the workflow parameter data object.
- Type:
str
- unique_columns
List of unique columns in the metadata file.
- Type:
list[str]
- mass_spec_desc
The description of the mass spectrometry data.
- Type:
str
- mass_spec_eluent_intro
The introduction to the mass spectrometry eluent.
- Type:
str
- processing_institution
The institution responsible for processing the data.
- Type:
str
- workflow_git_url
The URL of the workflow Git repository.
- Type:
str
- workflow_version
The version of the workflow.
- Type:
str
- analyte_category: str = 'nom'
- execution_resource: str = 'EMSL-RZR'
- generate_nom_analysis(file_path: Path, raw_data_id: str, data_gen_id: str, processed_data_id: str, CLIENT_ID: str, CLIENT_SECRET: str, calibration_id: str = None, incremented_id: str = None) NomAnalysis [source]
Generate a metabolomics analysis object from the provided file information.
- Parameters:
file_path (Path) – The file path of the metabolomics analysis data file.
raw_data_id (str) – The ID of the raw data associated with the analysis.
data_gen_id (str) – The ID of the data generation process that informed this analysis.
processed_data_id (str) – The ID of the processed data resulting from this analysis.
CLIENT_ID (str) – The client ID for the NMDC API.
CLIENT_SECRET (str) – The client secret for the NMDC API.
calibration_id (str, optional) – The ID of the calibration object used in the analysis. If None, no calibration is used.
incremented_id (str, optional) – The incremented ID for the metabolomics analysis. If None, a new ID will be minted.
- Returns:
The generated metabolomics analysis object.
- Return type:
nmdc.NomAnalysis
- get_calibration_id(calibration_path: str) str [source]
Get the calibration ID from the NMDC API using the md5 checksum of the calibration file.
- Parameters:
calibration_path (str) – The file path of the calibration file.
- Returns:
The calibration ID if found, otherwise None.
- Return type:
str
- mass_spec_desc: str = 'ultra high resolution mass spectrum'
- mass_spec_eluent_intro: str = 'direct_infusion_autosampler'
- processed_data_category: str = 'processed_data'
- processed_data_object_type: str = 'FT ICR-MS Analysis Results'
- processing_institution: str = 'EMSL'
- raw_data_object_type: str = 'Direct Infusion FT ICR-MS Raw Data'
- rerun()[source]
Execute a rerun of the metadata generation process.
This method processes the metadata file, generates biosamples (if needed) and metadata, and manages the workflow for generating NOM analysis data.
Assumes raw data for NOM are on minio and that the raw data object URL field is populated.
- run()[source]
Execute the metadata generation process.
This method processes the metadata file, generates biosamples (if needed) and metadata, and manages the workflow for generating NOM analysis data.
- unique_columns: list[str] = ['raw_data_file', 'processed_data_directory']
- workflow_analysis_name: str = 'NOM Analysis'
- workflow_description: str = 'Natural Organic Matter analysis of raw mass spectrometry data.'
- workflow_git_url: str = 'https://github.com/microbiomedata/enviroMS'
- workflow_param_data_category: str = 'workflow_parameter_data'
- workflow_param_data_object_type: str = 'Analysis Tool Parameter File'
- workflow_version: str = '4.3.1'