Data Exploration and Visualization of NOM Data in NMDC (FT ICR-MS)¶

This notebook identifies natural organic matter (NOM) data sets in the National Microbiome Data Collaborative (NMDC), filters those datasets based on quality control metrics, and analyzes the molecular composition of the chosen datasets via heatmaps and Van Krevelen plots. This notebook uses the nmdc_api_utilities package (as of March 2025) to do this exploration. It involves using nmdc_api_utilites objects to make NMDC API requests easier, and using utility function to help us with data processing tasks. More information about the package can be found here

In [3]:
import requests
from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sys
from matplotlib.patches import Patch
import nmdc_api_utilities

Define a function to split a list into chunks¶

Since we will need to use a list of ids to query a new collection in the API, we need to limit the number of ids we put in a query. This function splits a list into chunks of 100. Note that the chunk_size has a default of 100, but can be adjusted.

In [4]:
# Define a function to split ids into chunks
def split_list(input_list, chunk_size=100):
    result = []
    
    for i in range(0, len(input_list), chunk_size):
        result.append(input_list[i:i + chunk_size])
        
    return result

Define a function to get a list of ids from initial results¶

In order to use the identifiers retrieved from an initial API request in another API request, this function is defined to take the initial request results and use the id_name key from the results to create a list of all the ids. The input is the initial result list and the name of the id field.

In [5]:
def get_id_list(result_list: list, id_name: str):
    id_list = []
    for item in result_list:
        if type(item[id_name]) == str:
            id_list.append(item[id_name])
        elif type(item[id_name]) == list:
            for another_item in item[id_name]:
                id_list.append(another_item)

    return id_list

Gather the IDs for processed NOM results in NMDC by filtering for data objects of type "Direct Infusion FT ICR-MS"¶

Use the nmdc_api_utilies DataObjectSearch to gather processed NOM results. The function get_record_by_attribute gets record by input attribute name and value. See the documentation for more details here.

We pass in data_object_type using a keyword search of “Direct Infusion FT ICR-MS Analysis Results”. Extract the fields id (necessary for traversing the NMDC schema), url (necessary for pulling data) and md5_checksum (used to check uniqueness of data set). The fucntion will create the filter and API request from these paramters.

In [6]:
from nmdc_api_utilities.data_object_search import DataObjectSearch
dos_client = DataObjectSearch(env=ENV)
from nmdc_api_utilities.data_processing import DataProcessing
dp_client = DataProcessing()
processed_nom = dos_client.get_record_by_attribute(attribute_name='data_object_type', attribute_value='Direct Infusion FT-ICR MS Analysis Results', max_page_size=100, fields='id,md5_checksum,url', all_pages=True)
# clarify names
for dataobject in processed_nom:
    dataobject["processed_nom_id"] = dataobject.pop("id")
    dataobject["processed_nom_md5_checksum"] = dataobject.pop("md5_checksum")
    dataobject["processed_nom_url"] = dataobject.pop("url")

# convert to df
processed_nom_df = dp_client.convert_to_df(processed_nom)
processed_nom_df
Out[6]:
processed_nom_id processed_nom_md5_checksum processed_nom_url
0 nmdc:dobj-11-002nyy42 5cd810390dc64616e28579161fe15029 https://nmdcdemo.emsl.pnnl.gov/nom/blanchard_1...
1 nmdc:dobj-11-003x7710 0391d9ff37a1926d5cf0eaec126ad4ff https://nmdcdemo.emsl.pnnl.gov/nom/blanchard/r...
2 nmdc:dobj-11-00dewm52 2a532dca15798e470103ebd752a0937f https://nmdcdemo.emsl.pnnl.gov/nom/1000soils/r...
3 nmdc:dobj-11-00hx2g55 774931c22fb4881bf5258c9d203738ac https://nmdcdemo.emsl.pnnl.gov/nom/brodie_11_d...
4 nmdc:dobj-11-00wm3313 3ce562ac512457ea54bdda05a4f01ede https://nmdcdemo.emsl.pnnl.gov/nom/1000soils/r...
... ... ... ...
5450 nmdc:dobj-13-zrp1qw41 98b97b78fff542b66e72f4b3f792d80f https://nmdcdemo.emsl.pnnl.gov/nom/results/SBR...
5451 nmdc:dobj-13-zsqpnm92 3e9e19910edb209d211d9f915e36b8cb https://nmdcdemo.emsl.pnnl.gov/nom/results/Bro...
5452 nmdc:dobj-13-zvnmsp76 aec0521d6a36a440e41052f8eadc0d1d https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung...
5453 nmdc:dobj-13-zvzx2462 9f0d52cc46d247b8d2ba12d5842b9fb6 https://nmdcdemo.emsl.pnnl.gov/nom/results/Bro...
5454 nmdc:dobj-13-zye5fe51 8ddaeeffe93db9c4258a03e881d329cf https://nmdcdemo.emsl.pnnl.gov/nom/results/Bro...

5455 rows × 3 columns

Continue traversing the NMDC schema by using the list of identifiers from the previous API call to query the next collection¶

Find the analysis records that produced these processed nom object IDs by matching object processed_nom_id to the has_output slot with a filter and the WorkflowExecutionSearch object. Also extract the has_input slot as it will be used for the next traversal, grabbing the raw data objects that are used as input to the nom analysis records.

In [7]:
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
# since we are querying the WorkflowExecution collection, we need to create an instance of it
we_client = WorkflowExecutionSearch(env=ENV)
# use utility function to get a list of the ids from processed_nom
result_ids = get_id_list(processed_nom, "processed_nom_id")
# create a list of lists of ids to query the WorkflowExecutionSearch object
chunked_list = split_list(result_ids)
analysis_dataobj = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the WorkflowExecutionSearch object looking for data objects where the has_output field is in the chunk of ids from processed_nom
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
    # get the results
    analysis_dataobj += we_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)

# clarify names
for dataobject in analysis_dataobj:
    dataobject["analysis_id"] = dataobject.pop("id")
    dataobject["analysis_has_input"] = dataobject.pop("has_input")
    dataobject["analysis_has_output"] = dataobject.pop("has_output")

# convert to data frame
analysis_dataobj_df = dp_client.convert_to_df(analysis_dataobj)
analysis_dataobj_df
Out[7]:
analysis_id analysis_has_input analysis_has_output
0 nmdc:wfnom-11-twqqjq28.2 [nmdc:dobj-11-vq0zcy81, nmdc:dobj-11-6qh8ae86] [nmdc:dobj-11-mfvjhy66, nmdc:dobj-11-002nyy42]
1 nmdc:wfnom-11-dv3wck24.1 [nmdc:dobj-11-2ysghy98, nmdc:dobj-11-bmeg5b74] [nmdc:dobj-11-003x7710]
2 nmdc:wfnom-11-0mqv1c63.1 [nmdc:dobj-11-3hfzr472] [nmdc:dobj-11-00dewm52]
3 nmdc:wfnom-13-ft8pbk34.2 [nmdc:dobj-11-8dpvb326, nmdc:dobj-13-ga1c7z75] [nmdc:dobj-11-q1h3p187, nmdc:dobj-11-00hx2g55]
4 nmdc:wfnom-11-twkd5a03.1 [nmdc:dobj-11-2mk4qf09] [nmdc:dobj-11-00wm3313]
... ... ... ...
5450 nmdc:wfnom-13-08q56295.1 [nmdc:dobj-13-vg0tgv60] [nmdc:dobj-13-zrp1qw41]
5451 nmdc:wfnom-13-yy5yqx31.1 [nmdc:dobj-13-w15zk074] [nmdc:dobj-13-zsqpnm92]
5452 nmdc:wfnom-13-4xrsd836.1 [nmdc:dobj-13-3rmgcm64] [nmdc:dobj-13-zvnmsp76]
5453 nmdc:wfnom-13-h0r53g59.1 [nmdc:dobj-13-m3zkxg02] [nmdc:dobj-13-zvzx2462]
5454 nmdc:wfnom-13-4vmrb525.1 [nmdc:dobj-13-60z8rt70] [nmdc:dobj-13-zye5fe51]

5455 rows × 3 columns

Find the raw data objects used as input for these analysis records by matching the analysis record's has_input slot to the id slot in the collection data_object_set. Create a DataObjectSearch object to do so.

Workflows can take multiple inputs (eg. configuration or parameter files) in addition to the raw data, so we will filter the results to only keep raw data object IDs.

In [8]:
# use utility function to get a list of the ids from raw_dataobj
result_ids = get_id_list(analysis_dataobj, "analysis_has_input")
# create a list of lists of ids to query the DataObjectSearch object
chunked_list = split_list(result_ids)
raw_dataobj = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the DataObjectSearch object looking for data objects where the id field is in the chunk of ids from analysis_dataobj
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"id": {{"$in": {filter_list}}}}}'
    # get the results
    raw_dataobj += dos_client.get_record_by_filter(filter=filter, fields="id,name,data_object_type", max_page_size=100, all_pages=True)

# clarify names
for dataobject in raw_dataobj:
    dataobject["raw_id"] = dataobject.pop("id")
    dataobject["raw_name"] = dataobject.pop("name")

# Filter out parameter files (leave NAs in case raw data files are missing a label)
param_dataobj = [file for file in raw_dataobj if 'data_object_type' in file and file['data_object_type'] == 'Analysis Tool Parameter File']
raw_dataobj = [file for file in raw_dataobj if file not in param_dataobj]

raw_df = dp_client.convert_to_df(raw_dataobj)

raw_df
Out[8]:
data_object_type raw_id raw_name
0 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-18tyaw19 Lybrand_FT_51_C_30Aug19_Alder_Infuse_p05_1_01_...
1 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-2mk4qf09 1000s_DELA_FTMS_SPE_TOP_2_29Oct22_Mag_300SA_p0...
2 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-2ysghy98 Blanchard_CHCl3Ext_H-32-AB-M_19Feb18_Alder_Inf...
3 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-3771q326 1000S_PETF_FTMS_SPE_TOP_1_run2_Fir_22Apr22_300...
4 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-3gj5k979 1000s_ISNC_FTMS_SPE_TOP_2_01Nov22_Mag_300SA_p0...
... ... ... ...
5433 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-13-yqb8xh13 output: Brodie_186_MeOH_R1_19Mar19_HESI_Neg
5434 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-13-yscv7j32 output: Brodie_203_R2_CHCl3_05May19_Alder_Infu...
5435 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-13-zazrqk87 output: Brodie_185_w_r1_01Feb19_HESI_neg
5436 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-13-zjrg8w43 output: Brodie_184_H2O_11Mar19_R1_HESI_Neg
5437 Direct Infusion FT ICR-MS Raw Data nmdc:dobj-13-zzzyae97 output: SBR_FC_N1_00-10_H2Oext_13Oct15_Leopard...

5438 rows × 3 columns

Find the data generation records capturing mass spectrometry information for these raw data objects by matching the data object's id slot to the has_output slot in the collection data_generation_set. Create a DataGenerationSearch object to do so. Once again extract the has_input slot as it will be used for the next traversal, grabbing the biosample data objects that are used as input to the data generation records.

In [9]:
from nmdc_api_utilities.data_generation_search import DataGenerationSearch
dgs_client = DataGenerationSearch(env=ENV)
# use utility function to get a list of the ids from raw_dataobj
result_ids = get_id_list(raw_dataobj, "raw_id")
# create a list of lists of ids to query the DataGenerationSearch object
chunked_list = split_list(result_ids)
datageneration_dataobj = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the DataGenerationSearch object looking for data objects that have the raw_id in the has_output field
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
    # get the results
    datageneration_dataobj += dgs_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)

# clarify names
for dataobject in datageneration_dataobj:
    dataobject["datageneration_id"] = dataobject.pop("id")
    dataobject["datageneration_has_output"] = dataobject.pop("has_output")
    dataobject["datageneration_has_input"] = dataobject.pop("has_input")
# convert to data frame
datageneration_dataobj_df = dp_client.convert_to_df(datageneration_dataobj)
datageneration_dataobj_df
Out[9]:
datageneration_id datageneration_has_output datageneration_has_input
0 nmdc:omprc-11-adjx8k29 [nmdc:dobj-11-04embv91] [nmdc:procsm-11-xsfbpz40]
1 nmdc:omprc-11-rj3bqn04 [nmdc:dobj-11-09p17z03] [nmdc:procsm-11-ckacrz04]
2 nmdc:omprc-11-tddeh411 [nmdc:dobj-11-0df8t083] [nmdc:procsm-11-fn9kvs80]
3 nmdc:omprc-11-ksv9m073 [nmdc:dobj-11-10madc26] [nmdc:procsm-11-rjan1g16]
4 nmdc:omprc-11-9vatav94 [nmdc:dobj-11-18tyaw19] [nmdc:procsm-11-j9avd879]
... ... ... ...
5415 nmdc:omprc-11-xpccey56 [nmdc:dobj-13-yqb8xh13] [nmdc:procsm-11-x1w09v45]
5416 nmdc:omprc-11-brkhe637 [nmdc:dobj-13-yscv7j32] [nmdc:procsm-11-rex2jg38]
5417 nmdc:omprc-11-5y5txf92 [nmdc:dobj-13-zazrqk87] [nmdc:procsm-11-e8jnyn97]
5418 nmdc:omprc-11-077nww93 [nmdc:dobj-13-zjrg8w43] [nmdc:procsm-11-s1ypy935]
5419 nmdc:omprc-13-xemx6b61 [nmdc:dobj-13-zzzyae97] [nmdc:procsm-11-t7qjg604]

5420 rows × 3 columns

Find the inputs for these data generation records based on whether they are biosamples or processed samples. This can vary depending on if a sample went through any material processing prior to mass spectrometry.

In [10]:
datageneration_dataobj_df['datageneration_has_input'].explode().str.extract('nmdc:(procsm|bsm)')[0].value_counts(dropna=False)
Out[10]:
0
procsm    5131
bsm        289
Name: count, dtype: int64

Find the biosample from which each processed sample originates by creating a MaterialProcessingSet and iterating through material processing steps until the has_input value is a biosample.

In [11]:
from nmdc_api_utilities.material_processing_search import MaterialProcessingSearch
mp_client = MaterialProcessingSearch()

# use utility function to get a unique list of the ids from datageneration_dataobj
result_ids = list(set(get_id_list(datageneration_dataobj, "datageneration_has_input")))

# separate ids into processedsample_ids and biosample_ids
processedsamples = [item for item in result_ids if 'nmdc:procsm' in item]
biosamples = [item for item in result_ids if 'nmdc:bsm' in item]

# dataframe of intermediate processed samples to track
tracking_ps = pd.DataFrame({'intermediate_ps':processedsamples,'dg_input':processedsamples})

# dataframe of tracked biosample inputs mapped to the input given to data generation
tracked = [pd.DataFrame({'biosample_id':biosamples,'dg_input':biosamples})]


while tracking_ps.shape[0]>0:

    # get api calls for these processed samples
    chunked_list = split_list(tracking_ps['intermediate_ps'].tolist())
    mp_dataobj = []
    for chunk in chunked_list:
        filter_list = dp_client._string_mongo_list(chunk)
        filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
        mp_dataobj += mp_client.get_record_by_filter(filter=filter, fields='has_input,has_output', max_page_size=100, all_pages=True)

    mp_dataobj = dp_client.convert_to_df(mp_dataobj).explode('has_input').explode('has_output').merge(tracking_ps,right_on='intermediate_ps',left_on='has_output').drop(['has_output','id'],axis=1)

    # separate steps with biosample vs processed sample inputs
    biosample_input = mp_dataobj[mp_dataobj['has_input'].str.contains('nmdc:bsm',na=False)].drop('intermediate_ps',axis=1).rename(columns={'has_input':'biosample_id'})
    processedsample_input = mp_dataobj[mp_dataobj['has_input'].str.contains('nmdc:procsm',na=False)].drop('intermediate_ps',axis=1).rename(columns={'has_input':'intermediate_ps'}) # new ps input is now intermediate_ps

    # update tracking_ps
    tracking_ps = processedsample_input

    # update tracked to those steps with a biosample input
    if not biosample_input.empty:
        tracked.append(biosample_input)

datageneration_inputupdate_df = pd.concat(tracked, ignore_index=True).merge(datageneration_dataobj_df.explode('datageneration_has_input').explode('datageneration_has_output'),left_on='dg_input',right_on='datageneration_has_input').\
    drop(['dg_input','datageneration_has_input'],axis=1).drop_duplicates()
datageneration_inputupdate_df
Out[11]:
biosample_id datageneration_id datageneration_has_output
0 nmdc:bsm-11-ntfp9891 nmdc:dgms-11-5w6rk541 nmdc:dobj-11-jj1tsc78
1 nmdc:bsm-11-pbt9v005 nmdc:dgms-11-1ckd5n11 nmdc:dobj-11-2jyp5563
2 nmdc:bsm-11-c26daa78 nmdc:dgms-11-4tbhwf96 nmdc:dobj-11-v5rw5j82
3 nmdc:bsm-11-8021r407 nmdc:dgms-11-wrb2dn54 nmdc:dobj-11-gwdsc331
4 nmdc:bsm-11-8021r407 nmdc:dgms-11-jpg2s763 nmdc:dobj-11-wzpa4n90
... ... ... ...
7481 nmdc:bsm-11-br03yr77 nmdc:omprc-11-9cm29n56 nmdc:dobj-11-nyhzf057
7484 nmdc:bsm-11-br03yr77 nmdc:omprc-11-x2j9vv84 nmdc:dobj-11-pjzvrm80
7486 nmdc:bsm-11-yx0xgn55 nmdc:omprc-11-nr1k1h16 nmdc:dobj-11-tx0ajr95
7487 nmdc:bsm-11-yx0xgn55 nmdc:omprc-11-0z8p2r24 nmdc:dobj-11-sdbbsv31
7489 nmdc:bsm-11-yx0xgn55 nmdc:omprc-11-jm27g388 nmdc:dobj-11-vfgbgb25

2872 rows × 3 columns

Find the biosample data objects used as input for these data generation records by matching the processing record's has_input slot to the id slot in the collection biosample_set. Create a BiosampleSearch object to do so. Query all fields in this API call by excluding the fields parameter and utilize informative columns to group biosamples into a type.

In [12]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
bs_client = BiosampleSearch(env=ENV)

# get a list of the biosample ids from datageneration_inputupdate_df
result_ids = datageneration_inputupdate_df["biosample_id"].unique().tolist()
# create a list of lists of ids to query the BiosampleSearch object
chunked_list = split_list(result_ids)
biosample_dataobj = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the BiosampleSearch object looking for data objects where the id is in the chunk of ids from datageneration_dataobj
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"id": {{"$in": {filter_list}}}}}'
    # get the results
    biosample_dataobj += bs_client.get_record_by_filter(filter=filter, max_page_size=100, all_pages=True)

# clarify names
for dataobject in biosample_dataobj:
    dataobject["biosample_id"] = dataobject.pop("id")

# convert to data frame
biosample_dataobj_df = dp_client.convert_to_df(biosample_dataobj)

biosample_dataobj_df
Out[12]:
type name description associated_studies env_broad_scale env_local_scale env_medium ecosystem ecosystem_category ecosystem_type ... insdc_biosample_identifiers provenance_metadata carb_nitro_ratio experimental_factor tot_org_carb other_treatment gaseous_environment air_temp_regm soil_horizon sample_link
0 nmdc:Biosample WHONDRS_S19S_0093_SED_FIELD_U River sediment microbial communities from Wood... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 nmdc:Biosample WHONDRS_S19S_0031_SED_FIELD_M River sediment microbial communities from McRa... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 nmdc:Biosample WHONDRS_S19S_0087_SED_FIELD_M River sediment microbial communities from Dobs... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 nmdc:Biosample Bulk soil microbial communities from the East ... Bulk soil microbial communities from the East ... [nmdc:sty-11-dcqce727] {'has_raw_value': 'ENVO:01000177', 'term': {'i... {'has_raw_value': 'ENVO:00000292', 'term': {'i... {'has_raw_value': 'ENVO:00005802', 'term': {'i... Environmental Terrestrial Soil ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 nmdc:Biosample WHONDRS_S19S_0003_SED_FIELD_D River sediment microbial communities from Thom... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1141 nmdc:Biosample WHONDRS_S19S_0085_SW Freshwater microbial communities from Mores Cr... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1142 nmdc:Biosample WHONDRS_S19S_0063_SW Freshwater microbial communities from Lower Co... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1143 nmdc:Biosample WHONDRS_S19S_0078_SW Freshwater microbial communities from Little W... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1144 nmdc:Biosample WHONDRS_S19S_0024_SED_FIELD_D River sediment microbial communities from Blue... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1145 nmdc:Biosample WHONDRS_S19S_0021_SW Freshwater microbial communities from Black Wa... [nmdc:sty-11-5tgfr349] {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... {'type': 'nmdc:ControlledIdentifiedTermValue',... Environmental Aquatic Freshwater ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1146 rows × 53 columns

Assign a general type for each sample by parsing their ENVO IDs. This was done manually by searching ENVO ID's on the ontology search website.

In [13]:
biosample_dataobj_flat=pd.json_normalize(biosample_dataobj)
biosample_dataobj_flat_df=dp_client.convert_to_df(biosample_dataobj_flat)

biosample_dataobj_flat_df['sample_type']=""

biosample_dataobj_flat_df['env_medium.term.id'].drop_duplicates()

biosample_dataobj_flat_df.loc[biosample_dataobj_flat_df['env_medium.term.name'].str.contains('soil', na = False),'sample_type'] = 'soil'

biosample_dataobj_flat_df.loc[biosample_dataobj_flat_df['env_medium.term.id'].isin(["ENVO:00002261","ENVO:00001998","ENVO:00002259",
                                                                                    "ENVO:01001616","ENVO:00005750","ENVO:00005761",
                                                                                    "ENVO:00005760","ENVO:00005773","ENVO:00005802",
                                                                                    "ENVO:00005774"]),'sample_type'] = 'soil'
biosample_dataobj_flat_df.loc[biosample_dataobj_flat_df['env_medium.term.id'].isin(["ENVO:00002042"]),'sample_type'] = 'water'
biosample_dataobj_flat_df.loc[biosample_dataobj_flat_df['env_medium.term.id'].isin(["ENVO:00002007"]),'sample_type'] = 'sediment'
biosample_dataobj_flat_df.loc[biosample_dataobj_flat_df['env_medium.term.id'].isin(["ENVO:01000017"]),'sample_type'] = 'sand'

#filter to desired metadata columns
biosample_dataobj_flat_df=biosample_dataobj_flat_df[['biosample_id','geo_loc_name.has_raw_value','env_medium.term.name','env_medium.term.id','sample_type']]

biosample_dataobj_flat_df
Out[13]:
biosample_id geo_loc_name.has_raw_value env_medium.term.name env_medium.term.id sample_type
0 nmdc:bsm-11-04jj9q59 UK: Staffordshire, Stafford sediment ENVO:00002007 sediment
1 nmdc:bsm-11-0db9vd55 USA: Oregon, Blue River sediment ENVO:00002007 sediment
2 nmdc:bsm-11-0ebbhn67 USA: Idaho, Boise sediment ENVO:00002007 sediment
3 nmdc:bsm-11-0htjhf95 USA: Colorado NaN ENVO:00005802 soil
4 nmdc:bsm-11-0p5q3s54 Canada: British Columbia, Kamloops sediment ENVO:00002007 sediment
... ... ... ... ... ...
1141 nmdc:bsm-11-xvtbj202 USA: Idaho, Idaho City surface water ENVO:00002042 water
1142 nmdc:bsm-11-yfm9g713 USA: Texas, Austin surface water ENVO:00002042 water
1143 nmdc:bsm-11-yx0xgn55 USA: Oregon, Tyee surface water ENVO:00002042 water
1144 nmdc:bsm-11-z198km94 USA: Oklahoma, Connerville sediment ENVO:00002007 sediment
1145 nmdc:bsm-11-zjscaw68 USA: Alabama, Demopolis surface water ENVO:00002042 water

1146 rows × 5 columns

Create final data frame of relevant metadata and NMDC schema information for each NOM processed data object¶

Create merged dataframe with results from schema traversal and metadata using the DataProcessing object created earlier.

In [14]:
#match all processed nom objects (via processed_nom_id) to analysis objects (via analysis_has_output) and expand lists has_input and has_output
processed_obj_to_analysis_df=dp_client.merge_df(processed_nom_df,analysis_dataobj_df,"processed_nom_id","analysis_has_output")

#match raw data objects (via raw_id) to all_analysis_df (via analysis_has_input)
processed_obj_to_raw_df=dp_client.merge_df(raw_df,processed_obj_to_analysis_df,"raw_id","analysis_has_input")

#match processed_obj_to_raw_df (via raw_id) to data generation objects (via datageneration_has_output) and expand lists has_input and has_output
processed_obj_to_datagen_df=dp_client.merge_df(processed_obj_to_raw_df,datageneration_inputupdate_df,"raw_id","datageneration_has_output")

#match biosample objects (via biosample_id) to processed_obj_to_datagen_df (via datageneration_has_input)
merged_df=dp_client.merge_df(biosample_dataobj_flat_df,processed_obj_to_datagen_df,"biosample_id","biosample_id")

merged_df
Out[14]:
biosample_id geo_loc_name.has_raw_value env_medium.term.name env_medium.term.id sample_type data_object_type raw_id raw_name processed_nom_id processed_nom_md5_checksum processed_nom_url analysis_id analysis_has_input analysis_has_output datageneration_id datageneration_has_output
0 nmdc:bsm-11-04jj9q59 UK: Staffordshire, Stafford sediment ENVO:00002007 sediment Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-xec92s10 WHONDRS_S19S_R35_14Sept2020_Alder_Infuse_p15_1... nmdc:dobj-11-13ae7d36 fb9f5d3e6ce66d358f5116e2bc7940c1 https://nmdcdemo.emsl.pnnl.gov/nom/whondrs_5tg... nmdc:wfnom-11-bg1ag945.1 nmdc:dobj-11-xec92s10 nmdc:dobj-11-13ae7d36 nmdc:dgms-11-6b2a7q11 nmdc:dobj-11-xec92s10
1 nmdc:bsm-11-0db9vd55 USA: Oregon, Blue River sediment ENVO:00002007 sediment Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-wknyhh44 WHONDRS_S19S_096_17Mar20_Alder_Infuse_P2_1_01_... nmdc:dobj-11-k6747326 eba7c845fc24f52edb6718b38de26982 https://nmdcdemo.emsl.pnnl.gov/nom/whondrs_5tg... nmdc:wfnom-11-9qchfy79.1 nmdc:dobj-11-wknyhh44 nmdc:dobj-11-k6747326 nmdc:dgms-11-9hv52m48 nmdc:dobj-11-wknyhh44
2 nmdc:bsm-11-0ebbhn67 USA: Idaho, Boise sediment ENVO:00002007 sediment Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-ez3mzs91 WHONDRS_S19S_R56_14Sept2020_Alder_Infuse_p15_1... nmdc:dobj-11-rs3dwk50 69b0f48f41a450cd74fdc66d14afe34e https://nmdcdemo.emsl.pnnl.gov/nom/whondrs_5tg... nmdc:wfnom-11-q181fp49.1 nmdc:dobj-11-ez3mzs91 nmdc:dobj-11-rs3dwk50 nmdc:dgms-11-wvx6vj56 nmdc:dobj-11-ez3mzs91
3 nmdc:bsm-11-0htjhf95 USA: Colorado NaN ENVO:00005802 soil Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-3xj75s41 Brodie_147_H2O_1_13Dec18_IATp15_1_01_38568.zip nmdc:dobj-11-kgb9cq70 3db4bd11f1b280268977f9b2f862ab70 https://nmdcdemo.emsl.pnnl.gov/nom/brodie_11_d... nmdc:wfnom-11-92t9dw88.1 nmdc:dobj-11-3xj75s41 nmdc:dobj-11-kgb9cq70 nmdc:dgms-11-mbhp8f53 nmdc:dobj-11-3xj75s41
4 nmdc:bsm-11-0htjhf95 USA: Colorado NaN ENVO:00005802 soil Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-r4eazd74 Brodie_147A_MeOH_15Oct18_IAT_p1_1_01_35914.zip nmdc:dobj-11-msee3h51 88c3b4e891eac5b6c08f50de5b6a9103 https://nmdcdemo.emsl.pnnl.gov/nom/brodie_11_d... nmdc:wfnom-11-3rk4g078.1 nmdc:dobj-11-r4eazd74 nmdc:dobj-11-msee3h51 nmdc:dgms-11-mpr71g34 nmdc:dobj-11-r4eazd74
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5450 nmdc:bsm-11-zjscaw68 USA: Alabama, Demopolis surface water ENVO:00002042 water Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-pyffj610 WHONDRS_S19S_0021_ICR_1_131_Alder_Inf_13Sept19... nmdc:dobj-11-vs9f3q31 e630b2ba8de1bda330779434206e30dd https://nmdcdemo.emsl.pnnl.gov/nom/grow/result... nmdc:wfnom-11-e3atqj14.1 nmdc:dobj-11-pyffj610 nmdc:dobj-11-vs9f3q31 nmdc:omprc-11-9w980y09 nmdc:dobj-11-pyffj610
5451 nmdc:bsm-11-zjscaw68 USA: Alabama, Demopolis surface water ENVO:00002042 water Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-ht23xr87 WHONDRS_S19S_0021_ICR_3_45_Alder_Inf_13Sept19_... nmdc:dobj-11-j64p3v67 32b591015816104bf8031284766fc2c7 https://nmdcdemo.emsl.pnnl.gov/nom/grow/result... nmdc:wfnom-11-rw4y8534.1 nmdc:dobj-11-ht23xr87 nmdc:dobj-11-j64p3v67 nmdc:omprc-11-28tfb697 nmdc:dobj-11-ht23xr87
5452 nmdc:bsm-11-zjscaw68 USA: Alabama, Demopolis surface water ENVO:00002042 water Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-ht23xr87 WHONDRS_S19S_0021_ICR_3_45_Alder_Inf_13Sept19_... nmdc:dobj-11-t47bn545 d9a5ef66791e1b434c0df5b6fa8a8f9b https://nmdcdemo.emsl.pnnl.gov/nom/whondrs_5tg... nmdc:wfnom-11-rw4y8534.2 nmdc:dobj-11-ht23xr87 nmdc:dobj-11-t47bn545 nmdc:omprc-11-28tfb697 nmdc:dobj-11-ht23xr87
5453 nmdc:bsm-11-zjscaw68 USA: Alabama, Demopolis surface water ENVO:00002042 water Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-5zhs5711 WHONDRS_S19S_0021_ICR_2_149_Alder_Inf_13Sept19... nmdc:dobj-11-r0r31h32 df2fbc6a84fecbb2cc0a32b5b52d5b8c https://nmdcdemo.emsl.pnnl.gov/nom/grow/result... nmdc:wfnom-11-hqdbev94.1 nmdc:dobj-11-5zhs5711 nmdc:dobj-11-r0r31h32 nmdc:omprc-11-kba8wx72 nmdc:dobj-11-5zhs5711
5454 nmdc:bsm-11-zjscaw68 USA: Alabama, Demopolis surface water ENVO:00002042 water Direct Infusion FT ICR-MS Raw Data nmdc:dobj-11-5zhs5711 WHONDRS_S19S_0021_ICR_2_149_Alder_Inf_13Sept19... nmdc:dobj-11-vh2gbc64 7acf844d494002dedf238cf2ea70b048 https://nmdcdemo.emsl.pnnl.gov/nom/whondrs_5tg... nmdc:wfnom-11-hqdbev94.2 nmdc:dobj-11-5zhs5711 nmdc:dobj-11-vh2gbc64 nmdc:omprc-11-kba8wx72 nmdc:dobj-11-5zhs5711

5455 rows × 16 columns

Use the md5_checksum to check that each row/processed NOM object has an associated url that is unique

In [15]:
#are there any md5_checksum values that occur more than once (i.e. are associated with more than one processed nom id)
len(merged_df[merged_df.duplicated('processed_nom_md5_checksum')]['processed_nom_md5_checksum'].unique())==0
Out[15]:
False

Clean up final dataframe, removing unneeded/intermediate identifier columns from schema traversal.

In [16]:
column_list = merged_df.columns.tolist()
columns_to_keep = ["biosample_id","processed_nom_id","sample_type","processed_nom_url"]
columns_to_remove = list(set(column_list).difference(columns_to_keep))

# Drop unnecessary columns
merged_df_cleaned = merged_df.drop(columns=columns_to_remove)

# remove duplicate rows when keeping only necessary columns (should remove none)
final_df=merged_df_cleaned.drop_duplicates(keep="first").sort_values(by='processed_nom_id').reset_index(drop=True)

final_df
Out[16]:
biosample_id sample_type processed_nom_id processed_nom_url
0 nmdc:bsm-11-nq2tfm26 soil nmdc:dobj-11-002nyy42 https://nmdcdemo.emsl.pnnl.gov/nom/blanchard_1...
1 nmdc:bsm-11-cwey4y35 soil nmdc:dobj-11-003x7710 https://nmdcdemo.emsl.pnnl.gov/nom/blanchard/r...
2 nmdc:bsm-11-sk8vv550 soil nmdc:dobj-11-00dewm52 https://nmdcdemo.emsl.pnnl.gov/nom/1000soils/r...
3 nmdc:bsm-11-qpwvgx11 soil nmdc:dobj-11-00hx2g55 https://nmdcdemo.emsl.pnnl.gov/nom/brodie_11_d...
4 nmdc:bsm-11-yzgc1096 soil nmdc:dobj-11-00wm3313 https://nmdcdemo.emsl.pnnl.gov/nom/1000soils/r...
... ... ... ... ...
5450 nmdc:bsm-13-tr7n0581 sand nmdc:dobj-13-zrp1qw41 https://nmdcdemo.emsl.pnnl.gov/nom/results/SBR...
5451 nmdc:bsm-11-hf7a3g98 soil nmdc:dobj-13-zsqpnm92 https://nmdcdemo.emsl.pnnl.gov/nom/results/Bro...
5452 nmdc:bsm-13-4pbbvj63 sand nmdc:dobj-13-zvnmsp76 https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung...
5453 nmdc:bsm-11-mbnqn650 soil nmdc:dobj-13-zvzx2462 https://nmdcdemo.emsl.pnnl.gov/nom/results/Bro...
5454 nmdc:bsm-11-k05a2416 soil nmdc:dobj-13-zye5fe51 https://nmdcdemo.emsl.pnnl.gov/nom/results/Bro...

5455 rows × 4 columns

Randomly select datasets from each sample_type on which to calculate quality control statistics

In [17]:
dataset_counts = final_df[['processed_nom_id','sample_type']].drop_duplicates().value_counts('sample_type')
min_n = min(pd.DataFrame(dataset_counts)['count'])

#set sampling size to 50 or the smallest sample count, whichever value is smaller
if 50 < min_n:
    n = 50
else:
    n = min_n

#list the different types
list_type=dataset_counts.index.tolist()

#for each type, randomly sample n data sets and save them into list
random_subset=[]
for type in list_type:
    #each processed ID and sample type
    sample_type=final_df[['processed_nom_id','sample_type']].drop_duplicates()
    #filter to current sample type
    sample_type=sample_type[sample_type['sample_type']==type]
    #randomly sample n processed IDs in current sample type
    sample_type=sample_type.sample(n=n, random_state=2)
    #save
    random_subset.append(sample_type)

#resave list as dataframe
random_subset=pd.concat(random_subset).reset_index(drop=True)

#remerge rest of the data for the sampled data sets
final_df=random_subset.merge(final_df,on=['processed_nom_id','sample_type'],how="left")

final_df
Out[17]:
processed_nom_id sample_type biosample_id processed_nom_url
0 nmdc:dobj-11-25zsst16 soil nmdc:bsm-11-xgaz2a33 https://nmdcdemo.emsl.pnnl.gov/nom/1000soils/r...
1 nmdc:dobj-11-e4qryz20 soil nmdc:bsm-11-wy4w4f21 https://nmdcdemo.emsl.pnnl.gov/nom/grow/result...
2 nmdc:dobj-11-r53c1311 soil nmdc:bsm-11-y9gtp379 https://nmdcdemo.emsl.pnnl.gov/nom/grow/result...
3 nmdc:dobj-11-y34zyh53 soil nmdc:bsm-11-hf7a3g98 https://nmdcdemo.emsl.pnnl.gov/nom/brodie_11_d...
4 nmdc:dobj-11-5cqn4d45 soil nmdc:bsm-11-hf7a3g98 https://nmdcdemo.emsl.pnnl.gov/nom/brodie_11_d...
... ... ... ... ...
195 nmdc:dobj-13-6vb25j63 sand nmdc:bsm-13-rrg91187 https://nmdcdemo.emsl.pnnl.gov/nom/results/SBR...
196 nmdc:dobj-13-8dr0kz67 sand nmdc:bsm-13-z9wten89 https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung...
197 nmdc:dobj-11-w9nkvy37 sand nmdc:bsm-13-7tkyk513 https://nmdcdemo.emsl.pnnl.gov/nom/stegen_11_a...
198 nmdc:dobj-13-zrp1qw41 sand nmdc:bsm-13-tr7n0581 https://nmdcdemo.emsl.pnnl.gov/nom/results/SBR...
199 nmdc:dobj-13-1hqwq310 sand nmdc:bsm-13-4aw9ra87 https://nmdcdemo.emsl.pnnl.gov/nom/results/SBR...

200 rows × 4 columns

Calculate quality control statistics on the sampled processed nom results (this can take a while to run)¶

Explore what the files associated with these NOM data objects look like.

In [18]:
#example file
url=final_df.iloc[0]["processed_nom_url"]

#pull data as csv using url
response = requests.get(url)
csv_data = StringIO(response.text)
csv_df = pd.read_csv(csv_data)
csv_data.close()
csv_df
Out[18]:
Index m/z Calibrated m/z Calculated m/z Peak Height Peak Area Resolving Power S/N Ion Charge m/z Error (ppm) ... Molecular Formula C H O N S 13C 18O 33S 34S
0 4 121.029452 121.029533 121.029503 5.869179e+06 398.937198 1.489880e+06 26.502687 -1 0.245327 ... C7 H6 O2 7.0 6.0 2.0 NaN NaN NaN NaN NaN NaN
1 8 127.040007 127.040089 127.040068 7.016208e+06 630.361015 1.419620e+06 31.682174 -1 0.168567 ... C6 H8 O3 6.0 8.0 3.0 NaN NaN NaN NaN NaN NaN
2 9 127.076389 127.076471 127.076453 4.888778e+06 446.012307 1.418984e+06 22.075617 -1 0.141783 ... C7 H12 O2 7.0 12.0 2.0 NaN NaN NaN NaN NaN NaN
3 10 128.035256 128.035338 128.035317 5.945552e+06 542.894244 1.408585e+06 26.847553 -1 0.164661 ... C5 H7 O3 N1 5.0 7.0 3.0 1.0 NaN NaN NaN NaN NaN
4 11 129.055659 129.055741 129.055718 5.196441e+07 5155.360460 1.746668e+06 234.648878 -1 0.180467 ... C6 H10 O3 6.0 10.0 3.0 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8280 8222 650.326685 650.326798 NaN 7.774036e+06 17156.811887 3.465519e+05 35.104198 -1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8281 8236 658.261950 658.262063 NaN 6.078099e+06 14393.541103 2.739325e+05 27.446079 -1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8282 8251 666.178294 666.178407 NaN 7.061803e+06 18903.737253 2.706773e+05 31.888060 -1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8283 8265 678.098997 678.099110 NaN 5.869571e+06 14308.530092 2.659190e+05 26.504457 -1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8284 8276 850.203647 850.203741 NaN 6.068290e+06 18905.512162 2.651333e+05 27.401784 -1 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

8285 rows × 30 columns

Calculate % of peaks assigned and extract molecular formulas, H/C and O/C values from the processed NOM data sets. If there are multiple matches for a m/z peak, filter to the match with the highest confidence score.

In [19]:
errors = {}
iteration_counter = 0
mol_dict=[]
multi_peakmatch_file_counter=0
multi_peakmatch_peak_counter=0

for index, row in final_df.iterrows():

    iteration_counter += 1

    # print an update for every 50 iterations
    if iteration_counter % 50 == 0:
        print(f"Processed {iteration_counter} rows")
    
    #save data set level information
    url = row["processed_nom_url"]
    processed=row['processed_nom_id']
    sample_type=row['sample_type']

    try:
        
        # get CSV data using URL
        response = requests.get(url)
        csv_data = StringIO(response.text)
        csv_df = pd.read_csv(csv_data)
        # Ensure 'Molecular Formula' is string type
        if 'Molecular Formula' in csv_df.columns:
            csv_df['Molecular Formula'] = csv_df['Molecular Formula'].astype(str)
        csv_data.close()

        #calculate assigned peak percent
        unassigned=csv_df['Molecular Formula'].isnull().sum()
        assigned=csv_df['Molecular Formula'].count()
        assigned_perc=assigned/(unassigned+assigned)

        # If there are no rows, don't include this csv, skip to the next one
        if assigned + unassigned == 0:
            print(f"No peaks found in {processed}, skipping file")
            continue
        
        # If there are some rows, and no assigned peaks, the data might fail the shape assertions and break the notebook
        # which is not informative since it would get quality filtered out anyway. Skip to the next one
        elif unassigned > 0 and assigned == 0:
            print(f"No assigned peaks found in {processed}, skipping file")
            continue

        # If there are some rows, and at least one assigned peak, assert that the important columns are correct
        # This we DO want to break if it's wrong
        elif assigned > 0:
            
            # Workflow Output Tests
            # assert that all four columns are found in the file
            assert set(['Molecular Formula','H/C','O/C','Confidence Score']).issubset(csv_df.columns), "Workflow output does not contain the required columns. Check for workflow changes"

            # assert that H/C and O/C and Confidence Score are numeric
            assert pd.api.types.is_numeric_dtype(csv_df['H/C']), "Column 'H/C' is not numeric. Check for workflow changes"
            assert pd.api.types.is_numeric_dtype(csv_df['O/C']), "Column 'O/C' is not numeric. Check for workflow changes"
            assert pd.api.types.is_numeric_dtype(csv_df['Confidence Score']), "Column 'Confidence Score' is not numeric. Check for workflow changes"
            assert pd.api.types.is_string_dtype(csv_df['Molecular Formula']), "Column 'Molecular Formula' is not a string. Check for workflow changes"
        


        #check if any peaks (m/z) have multiple matches. if so take highest scoring match

        #list of peaks
        peak_matches=csv_df["m/z"]
        #list of duplicated peaks in csv_df
        dup_peaks=peak_matches[peak_matches.duplicated()]
        multi_peakmatch_peak_counter+=len(dup_peaks)
        #if there are duplicate peaks:
        if len(dup_peaks) > 0:
            multi_peakmatch_file_counter+=1
            #group cv_df by m/z and filter to max confidence score
            idx = csv_df.groupby('m/z')['Confidence Score'].max()
            #merge back with original df (that has all columns) to filter duplicate peak entries
            csv_df=csv_df.merge(idx,on=['m/z','Confidence Score'])


        #make dictionary
        #columns to be made into lists
        mol_df=csv_df[['Molecular Formula','H/C','O/C','Confidence Score']]

        #drop unassigned peaks
        mol_df=mol_df.dropna(subset='Molecular Formula')

        #append dictionary
        mol_dict.append({'processed':processed,
                         'sample_type':sample_type,
                         'assigned_peak_count':assigned,
                         'assigned_perc':assigned_perc,
                         'mol_form':mol_df['Molecular Formula'].to_list(),
                         'H/C':mol_df['H/C'].to_list(),
                         'O/C':mol_df['O/C'].to_list(),
                         'Confidence Score':mol_df['Confidence Score'].to_list()
                        })
    
    #if error print info and raise error to stop notebook
    except AssertionError as ae:
        print(f"Assertion error occurred: {ae}")
        errors["processed_id"] = processed
        errors["url"] = url
        raise RuntimeError(f"Assertion error occurred: {ae}. Check for workflow changes. Processed ID: {processed}, URL: {url}")

    #if error print info and raise error to stop notebook
    except Exception as e:
        print(f"An error occurred: {e}")
        errors["processed_id"] = processed
        errors["url"] = url
        raise RuntimeError(f"An error occurred: {e}. Processed ID: {processed}, URL: {url}")

#turn dictionaries into dataframes
nom_summary_df=pd.DataFrame(mol_dict)

nom_summary_df
Processed 50 rows
Processed 100 rows
Processed 150 rows
Processed 200 rows
Out[19]:
processed sample_type assigned_peak_count assigned_perc mol_form H/C O/C Confidence Score
0 nmdc:dobj-11-25zsst16 soil 5752 0.694267 [C7 H6 O2, C6 H8 O3, C7 H12 O2, C5 H7 O3 N1, C... [0.8571428571428571, 1.3333333333333333, 1.714... [0.2857142857142857, 0.5, 0.2857142857142857, ... [0.5934833388337128, 0.596914443718794, 0.5978...
1 nmdc:dobj-11-e4qryz20 soil 1028 0.560523 [C44 H40 O8, C30 H56 O16, C38 H54 O8, C31 H37 ... [0.9090909090909092, 1.8666666666666667, 1.421... [0.1818181818181818, 0.5333333333333333, 0.210... [0.5851409597274598, 0.543097918522988, 0.5202...
2 nmdc:dobj-11-r53c1311 soil 421 0.306405 [C58 H106 O6, C57 H102 O7, C59 H108 O3 S1, C58... [1.827586206896552, 1.7894736842105263, 1.8305... [0.1034482758620689, 0.1228070175438596, 0.050... [0.5273508470525398, 0.5925717524521122, 0.927...
3 nmdc:dobj-11-y34zyh53 soil 3413 0.544512 [C8 H16 O2, C5 H9 O4 N1, C8 H8 O3, C7 H6 O4, C... [2.0, 1.8, 1.0, 0.8571428571428571, 1.0, 1.75,... [0.25, 0.8, 0.375, 0.5714285714285714, 0.83333... [0.5247820644831742, 0.5543848480703312, 0.570...
4 nmdc:dobj-11-5cqn4d45 soil 881 0.260882 [C6 H5 O3 N1, C10 H12 O2 S1, C12 H9 O2 N1, C10... [0.8333333333333334, 1.2, 0.75, 0.8, 1.7142857... [0.5, 0.2, 0.1666666666666666, 0.5, 1.14285714... [0.5834609245135588, 0.5296422518163069, 0.598...
... ... ... ... ... ... ... ... ...
195 nmdc:dobj-13-6vb25j63 sand 16 0.216216 [C47 H32 O13, C52 H30 O1 S1, C22 H34 O20, C25 ... [0.6808510638297872, 0.5769230769230769, 1.545... [0.2765957446808511, 0.0192307692307692, 0.909... [0.130944996607923, 0.0366332148635344, 0.2117...
196 nmdc:dobj-13-8dr0kz67 sand 239 0.726444 [C64 H87 O14 S1 N1, C60 H117 O15 S1 N1, C67 H1... [1.359375, 1.95, 1.9402985074626864, 1.5538461... [0.21875, 0.25, 0.1194029850746268, 0.15384615... [0.4579424406180491, 0.0674971930615655, 0.235...
197 nmdc:dobj-11-w9nkvy37 sand 281 0.420659 [C8 H13 O3 N1, C23 H11 O1 N1, C22 H13 O3 N1, C... [1.625, 0.4782608695652174, 0.5909090909090909... [0.375, 0.0434782608695652, 0.1363636363636363... [0.5613362123214602, 0.5849666907425796, 0.582...
198 nmdc:dobj-13-zrp1qw41 sand 10 0.108696 [C36 H40 O13 S1, C31 H38 O16, C30 H38 O16 13C1... [1.1111111111111112, 1.2258064516129032, 1.266... [0.3611111111111111, 0.5161290322580645, 0.533... [0.2974966222382019, 0.9157497661487416, 0.875...
199 nmdc:dobj-13-1hqwq310 sand 49 0.441441 [C37 H38 O19, C35 H40 O19, C32 H43 O17 N1, C44... [1.027027027027027, 1.1428571428571428, 1.3437... [0.5135135135135135, 0.5428571428571428, 0.531... [0.1219599911079808, 0.3260896155543082, 0.298...

200 rows × 8 columns

Perform quality control filtering of data sets using the chosen metadata information¶

Assess the number and percentage of peaks identified across files in each sample type. This will help set thresholds for data set filtering.

In [20]:
peak_count_plot=sns.FacetGrid(nom_summary_df,col="sample_type",hue="sample_type",sharex=False,sharey=False)
peak_count_plot.map(sns.histplot,'assigned_peak_count')
peak_count_plot.set_xlabels("number of identified peaks")
peak_count_plot.set_ylabels("data set count")
peak_count_plot.set_titles(col_template="{col_name}")

peak_perc_plot=sns.FacetGrid(nom_summary_df,col="sample_type",hue="sample_type",sharex=False,sharey=False)
peak_perc_plot.map(sns.histplot,'assigned_perc')
peak_perc_plot.set_xlabels("percent of peaks identified")
peak_perc_plot.set_ylabels("data set count")
peak_perc_plot.set_titles(col_template="{col_name}")
Out[20]:
<seaborn.axisgrid.FacetGrid at 0x7f5f02ccbe90>
No description has been provided for this image
No description has been provided for this image

Apply filters to obtain high quality processed NOM data sets without removing all the files from any one sample type. Based on the figures above, requiring files to have at least 250 identified peaks that account for at least 30% of their total peak count will maintain a healthy number of data sets across sample types.

In [21]:
#filter data sets according to stats on peak assignment
nom_filt=nom_summary_df[nom_summary_df['assigned_peak_count']>=250]
nom_filt=nom_filt[nom_filt['assigned_perc']>=0.25]

#expand listed columns in molecular formula df
nom_filt_expanded=nom_filt.explode(['mol_form','H/C','O/C','Confidence Score'])

#resave expanded columns as numeric
nom_filt_expanded['O/C']=pd.to_numeric(nom_filt_expanded['O/C'])
nom_filt_expanded['H/C']=pd.to_numeric(nom_filt_expanded['H/C'])
nom_filt_expanded['Confidence Score']=pd.to_numeric(nom_filt_expanded['Confidence Score'])

#metadata group column
grouping_column='sample_type'

#count number of datasets in each type
count_type=nom_filt_expanded[['processed',grouping_column]].drop_duplicates().value_counts(grouping_column)

count_type
Out[21]:
sample_type
soil        44
sediment    32
water       29
sand        23
Name: count, dtype: int64

Randomly sample a minimum number of data sets to visualize from each sample type

In [22]:
#determine sampling size based on counts above
n=min(pd.DataFrame(count_type)['count'])

#list the different types
list_type=count_type.index.tolist()

#for each type, randomly sample n data sets and save them into list
nom_sampled=[]
for type in list_type:
    #each processed ID and sample type
    nom_type=nom_filt_expanded[['processed',grouping_column]].drop_duplicates()
    #filter to current sample type
    nom_type=nom_type[nom_type[grouping_column]==type]
    #randomly sample n processed IDs in current sample type
    nom_type=nom_type.sample(n=n, random_state=2)
    #save
    nom_sampled.append(nom_type)

#resave list as dataframe
nom_sampled=pd.concat(nom_sampled)

#remerge rest of the data for the sampled data sets
nom_sampled=nom_sampled.merge(nom_filt_expanded,on=['processed',grouping_column],how="left")

nom_sampled
Out[22]:
processed sample_type assigned_peak_count assigned_perc mol_form H/C O/C Confidence Score
0 nmdc:dobj-11-bqqqrq34 soil 4134 0.841784 C7 H6 O2 0.857143 0.285714 0.599848
1 nmdc:dobj-11-bqqqrq34 soil 4134 0.841784 C7 H8 O2 1.142857 0.285714 0.599905
2 nmdc:dobj-11-bqqqrq34 soil 4134 0.841784 C6 H6 O3 1.000000 0.500000 0.599662
3 nmdc:dobj-11-bqqqrq34 soil 4134 0.841784 C6 H10 O3 1.666667 0.500000 0.599523
4 nmdc:dobj-11-bqqqrq34 soil 4134 0.841784 C6 H12 O3 2.000000 0.500000 0.599556
... ... ... ... ... ... ... ... ...
262420 nmdc:dobj-13-95ygfg27 sand 310 0.873239 C15 H30 O2 2.000000 0.133333 0.597332
262421 nmdc:dobj-13-95ygfg27 sand 310 0.873239 C14 H10 O4 0.714286 0.285714 0.598825
262422 nmdc:dobj-13-95ygfg27 sand 310 0.873239 C15 H28 O2 1.866667 0.133333 0.534896
262423 nmdc:dobj-13-95ygfg27 sand 310 0.873239 C8 H16 O6 S1 2.000000 0.750000 0.078921
262424 nmdc:dobj-13-95ygfg27 sand 310 0.873239 C7 H12 O6 S1 1.714286 0.857143 0.058991

262425 rows × 8 columns

Filter to high confidence peaks (score greater than 0.3) and high frequency molecular formulas (present in more than 5 data sets). This will leave us with the most informative data.

In [23]:
#histogram of confidence scores for randomly selected samples
nom_sampled.hist("Confidence Score",bins=np.arange(0,1,0.1))
plt.title('Confidence Scores from Sampled Data Sets')
plt.xlabel("Confidence Score")
plt.xticks(np.arange(0,1,0.1))
plt.ylabel("Frequency")
plt.show()

#peak filtering by confidence score
conf_filt=nom_sampled[nom_sampled['Confidence Score']>=0.3]

#count for each molecular formula
mol_counts=conf_filt.value_counts('mol_form').to_frame().reset_index()

#histogram of molecular formula counts
mol_counts.hist("count",bins=np.arange(0,50,5))
plt.locator_params(axis='x')
plt.title('Molecular Formula Counts Across Sampled Data Sets')
plt.xlabel("Number of Occurences Across Data Sets")
plt.xticks(np.arange(0,50,5))
plt.ylabel("Number of Molecular Formula")
plt.show()

#based on this histogram, filter to formulas in more than 5 data sets
mol_counts=mol_counts[mol_counts['count']>=5]
mol_filter=mol_counts.merge(conf_filt,on=['mol_form'],how="left")
No description has been provided for this image
No description has been provided for this image

Assess patterns in the molecular formulas from different sample types¶

Create a clustermap of the processed NOM data sets (x axis), indicating the presence (black) or absence (white) of molecular formulas (y axis). The color bar will indicate sample type and help visualize the molecular similarity of data sets both within and between sample types.

In [24]:
#sample type colors

#set colors for each sample type
type_col=pd.DataFrame({grouping_column:mol_filter[grouping_column].unique(),'color':['#8B4513','olivedrab','blue','#B8860B']})

#setup color legend for sample type
type_col_dict = dict(zip(type_col[grouping_column].unique(), ['#8B4513','olivedrab','blue','#B8860B']))
handles = [Patch(facecolor=type_col_dict[name]) for name in type_col_dict]

#map graph colors based on sample type to processed IDs
sample_type=mol_filter[[grouping_column,'processed']].drop_duplicates()
sample_type_col=sample_type.merge(type_col,how='left',on=grouping_column).set_index('processed').drop(grouping_column,axis=1).rename(columns={'color':grouping_column})

#presence/absence

#set colors for presence/absence as cmap
colors=['white','black']
custom_palette = sns.color_palette(colors)
custom_cmap = sns.color_palette(custom_palette, as_cmap=True)

#add column indicating presence in that processed nom id (1 is present)
mol_filter['presence']=1

#create presence/absence matrix. replace NA with 0 (0 is absent)
formula_matrix=mol_filter[['mol_form','processed','presence']].pivot_table('presence', index='mol_form', columns='processed').fillna(0).astype(int)

#heatmap (1 is present, zero is absent)
g=sns.clustermap(data=formula_matrix,col_colors=sample_type_col,tree_kws={"linewidths": 0.},xticklabels=False,yticklabels=False,cmap=custom_cmap)
g.figure.suptitle("Presence of Molecular Formulas Across NOM Data Sets")
g.ax_heatmap.set_xlabel("Processed NOM Data Sets")
g.ax_heatmap.set_ylabel("Molecular Formulas")

#adjust plot and legend locations
g.figure.subplots_adjust(top=1.15,right=0.8)
g.ax_cbar.set_position((0.85, 0.45, .05, .05)) #x axis,y axis, width, height

#adjust cbar legend to indicate presence/absence
g.ax_cbar.set_yticks([0.25,0.75])
g.ax_cbar.set_yticklabels(["absent","present"])

#add sample type legend
plt.legend(handles, type_col_dict, title=None,
           bbox_to_anchor=(0.95, 1.05), bbox_transform=plt.gcf().transFigure)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance.
  warnings.warn(msg)
Out[24]:
<matplotlib.legend.Legend at 0x7f5f02090b50>
No description has been provided for this image

Create Van Krevelen diagrams to visually assess the atomic composition of each sample type. Van Krevelen diagrams plot the hydrogen to carbon ratio against the oxygen to carbon ratio, historically to assess petroleum samples.

In [25]:
# the same molecular formula will have the same H/C and O/C value in every data set, so dots are only unique per sample type
vankrev_data=mol_filter[[grouping_column,'mol_form','H/C','O/C']].drop_duplicates()

#make van krevlen plot
g=sns.FacetGrid(vankrev_data,col=grouping_column,hue=grouping_column,palette=type_col_dict)
g.map(sns.scatterplot,'O/C','H/C',s=1.5)
g.set_titles(col_template="{col_name}")
g.figure.suptitle("Van Krevelen Plots By General Sample Type")
g.figure.subplots_adjust(top=0.8)
No description has been provided for this image

Create marginal density plot to assess how molecular formulas unique to each sample type compare to those shared by all four sample types.

In [26]:
#count the number of sample types in which each molecular formula is found
density_counts=vankrev_data['mol_form'].value_counts()

#map counts back to H/C and O/C data
vankrev_data['common_to'] = vankrev_data['mol_form'].map(density_counts)

#filter molecular formual present in either all four sample types or in only one
vankrev_data=vankrev_data[vankrev_data['common_to'].isin([1,4])]

#reformat so sample type is 'all' when molecular formula was in all
vankrev_data.loc[vankrev_data['common_to']==4,'sample_type']='all types'

#add 'all' to sample type color dictionary
type_col_dict.update({'all types':'black'})

#make marginal density plot
sns.jointplot(data=vankrev_data, x="O/C", y="H/C", kind="scatter", hue=grouping_column,palette=type_col_dict,s=5,alpha=0.5)
plt.legend(markerscale=3,title="unique to")
plt.suptitle("Marginal Density Plot of Molecular Formulas by Sample Type",y=1.02)
plt.show()
No description has been provided for this image