How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?¶

This notebook uses the nmdc_api_utilities package (as of March 2025) to explore how the taxononomic distribution of contigs differ by the mineral and organic soil layers in Colorado. It involves using nmdc_api_utilites objects to make NMDC API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Iterating through the TSV files includes 350+ API calls to get the necessary taxonomic counts and is time consuming.

In [4]:
import requests
import pandas as pd
from io import StringIO
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import nmdc_api_utilities
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/kaleido/_sync_server.py:11: UserWarning:



Warning: You have Plotly version 5.18.0, which is not compatible with this version of Kaleido (1.2.0).

This means that static image generation (e.g. `fig.write_image()`) will not work.

Please upgrade Plotly to version 6.1.1 or greater, or downgrade Kaleido to version 0.2.1.


1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name¶

The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Moutains" for the geo_loc_name. Using the Python package 'nmdc_api_utilities', we can use the get_record_by_filter function to do this. We first need create a BiosampleSearch object to search across the "biosample_set" collections. More information regarding the nmdc_api_utilities package can be found here. We then create a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is returned no matter what. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead. Finally, we convert the biosample results to a dataframe called biosample_df. Note that about 517 biosamples are returned.

In [5]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
from nmdc_api_utilities.data_processing import DataProcessing
# Create a BiosampleSearch object
bs_client = BiosampleSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# define the filter
filter = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}'
# get the results
bs_results = bs_client.get_record_by_filter(filter=filter, fields="id,soil_horizon,geo_loc_name", max_page_size=100, all_pages=True)
# clarify names
for biosample in bs_results:
    biosample["biosample_id"] = biosample.pop("id")

# convert to df
biosample_df = dp_client.convert_to_df(bs_results)

# Adjust geo_loc_name to not be a dictionary
biosample_df["geo_loc_name"] = biosample_df["geo_loc_name"].apply(lambda x: x.get("has_raw_value"))
biosample_df
Out[5]:
soil_horizon geo_loc_name biosample_id
0 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-00m15h97
1 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-06ta8e31
2 O horizon USA: Colorado, Rocky Mountains nmdc:bsm-11-06tgpb52
3 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-0asn5d63
4 M horizon USA: Colorado, North Sterling nmdc:bsm-11-0djp2e45
... ... ... ...
513 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-zhrzwh12
514 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zhzner35
515 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zjsrkd21
516 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zk6h3328
517 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-znvc3c66

518 rows × 3 columns

Define a function to split a list into chunks¶

Since we will need to use a list of ids in filter to query a new collection in the API, we need to limit the number of ids we put in a filter. This function splits a list into chunks of 100. Note that the chunk_size has a default of 100, but can be adjusted.

In [6]:
# Define a function to split ids into chunks
def split_list(input_list, chunk_size=100):
    result = []
    
    for i in range(0, len(input_list), chunk_size):
        result.append(input_list[i:i + chunk_size])
        
    return result

Define a function to get a list of ids from initial results¶

This function creates a list of identifiers from the input list of responses from the nmdc_api_utilities functions. It use id_name key from the results to create a list of all the ids. The input is the initial result list and the name of the id field.

In [7]:
def get_id_list(result_list: list, id_name: str):
    id_list = []
    for item in result_list:
        if type(item[id_name]) == str:
            id_list.append(item[id_name])
        elif type(item[id_name]) == list:
            for another_item in item[id_name]:
                id_list.append(another_item)

    return id_list

2. Get all Pooling results where the Pooling has_input are the biosample ids¶

We want to query the material processing collection, so we create a MaterialProcessingSearch object. We use the get_record_by_filter function from this object get a list of all pooling results whose field, has_input are the biosample_ids we retrieved in step 1. As touched on earlier, we also want to ensure we are not passing too many ids into a filter at once, so we utilize the get_id_list and split_list functions to create chunks and iterate over. We will return the results for has_output as well and clean up the names so it is clear which collection the results are from. We also create a filter to query records where type is nmdc:Pooling. Finally, the pooling results are converted to a data frame.

In [8]:
from nmdc_api_utilities.material_processing_search import MaterialProcessingSearch

# create a MaterialProcessingSearch object
mp_client = MaterialProcessingSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# process the biosamples in chunks
result_ids = get_id_list(bs_results, "biosample_id")
chunked_list = split_list(result_ids)
pooling = []
for chunk in chunked_list:
    # create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:Pooling", "has_input": {{"$in": {filter_list}}}}}'
    # get the results
    pooling += mp_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)
# clarify names/keys/identifiers
for pool in pooling:
    pool["pooling_has_input"] = pool.pop("has_input")
    pool["pooling_has_output"] = pool.pop("has_output")
    pool["pooling_id"] = pool.pop("id")

pooling_df = dp_client.convert_to_df(pooling)
pooling_df
Out[8]:
pooling_has_input pooling_has_output pooling_id
0 [nmdc:bsm-11-5228zz06, nmdc:bsm-11-1frj0t76, n... [nmdc:procsm-11-49bwy122] nmdc:poolp-11-a1nnyd94
1 [nmdc:bsm-11-e0qcsb54, nmdc:bsm-11-3admsx52, n... [nmdc:procsm-11-cnz65b78] nmdc:poolp-11-gc19j338
2 [nmdc:bsm-11-ex491068, nmdc:bsm-11-1byjjh32, n... [nmdc:procsm-11-kngzyt90] nmdc:poolp-11-sj9jpg87
3 [nmdc:bsm-11-ehyv5z41, nmdc:bsm-11-48nzey88, n... [nmdc:procsm-11-9th0yt69] nmdc:poolp-11-rx280a54
4 [nmdc:bsm-11-2744k638, nmdc:bsm-11-85vfjq03, n... [nmdc:procsm-11-mdcbpc97] nmdc:poolp-11-w8b7cv95
... ... ... ...
398 [nmdc:bsm-11-znvc3c66, nmdc:bsm-11-wsr4vx16, n... [nmdc:procsm-11-dvq1cx16] nmdc:poolp-11-b13j8g68
399 [nmdc:bsm-11-4k0jmb52, nmdc:bsm-11-ydtfff55, n... [nmdc:procsm-11-mpcvhx03] nmdc:poolp-11-1rp6ns28
400 [nmdc:bsm-11-sgtk2z38, nmdc:bsm-11-xqtg8327, n... [nmdc:procsm-11-f6kc8b10] nmdc:poolp-11-ykrp9878
401 [nmdc:bsm-11-yzpe6s26, nmdc:bsm-11-zfvcsy45, n... [nmdc:procsm-11-wm0mqq15] nmdc:poolp-11-bsnbr836
402 [nmdc:bsm-11-zk6h3328, nmdc:bsm-11-kft4w435, n... [nmdc:procsm-11-e015da88] nmdc:poolp-11-658v9v07

403 rows × 3 columns

2.5 Merge biosample and pooling results¶

We utilize the DataProcessing object's merge_df function to merge the newly acquired pooling results with the original biosample results obtained from the package in step 1. We use the pooling_has_input and biosample_id from the two data frames as key names to merge on.

In [9]:
merged_df1 = dp_client.merge_df(pooling_df, biosample_df, "pooling_has_input", "biosample_id")
merged_df1
Out[9]:
pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id
0 nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-5228zz06
1 nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-1frj0t76
2 nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-nyxsx333
3 nmdc:bsm-11-e0qcsb54 nmdc:procsm-11-cnz65b78 nmdc:poolp-11-gc19j338 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-e0qcsb54
4 nmdc:bsm-11-3admsx52 nmdc:procsm-11-cnz65b78 nmdc:poolp-11-gc19j338 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-3admsx52
... ... ... ... ... ... ...
1087 nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xqtg8327
1088 nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-z5cmyh06
1107 nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-stjpwh75
1108 nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xngp2r34
1109 nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-yehx2807

505 rows × 6 columns

3. Get processed sample results where the the processed sample ids are the pooling_has_output field¶

Since we want to query the processed sample collection, we create a ProcessedSampleSearch object and use the get_record_by_filter function through this object. This provides a list of the processed sample results whose field, pooling_has_output are the processed sample ids. We will return the results only for the processed sample id field and clean up the names so it is clear that they are the identifiers from the processed_sample_set. Finally, the results are converted to a data frame.

In [10]:
from nmdc_api_utilities.processed_sample_search import ProcessedSampleSearch

# create a ProcessedSampleSearch object
ps_client = ProcessedSampleSearch(env=ENV)
# process the pooling in chunks
result_ids = get_id_list(pooling, "pooling_has_output")
chunked_list = split_list(result_ids)
process_set1 = []
for chunk in chunked_list:
    # create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:ProcessedSample", "id": {{"$in": {filter_list}}}}}'
    # get the results
    process_set1 += ps_client.get_record_by_filter(filter=filter, fields="id", max_page_size=100, all_pages=True)

# clarify names
for processed_sample in process_set1:
    processed_sample["processed_sample1"] = processed_sample.pop("id")

ps1_df = dp_client.convert_to_df(process_set1)
ps1_df
Out[10]:
processed_sample1
0 nmdc:procsm-11-1sr06083
1 nmdc:procsm-11-258vbz70
2 nmdc:procsm-11-2fxf0e98
3 nmdc:procsm-11-2xvsb693
4 nmdc:procsm-11-33n4p085
... ...
364 nmdc:procsm-11-ztam2998
365 nmdc:procsm-11-zw2k5d74
366 nmdc:procsm-11-e015da88
367 nmdc:procsm-11-f6kc8b10
368 nmdc:procsm-11-wm0mqq15

369 rows × 1 columns

3.5 Merge processed sample results with the previously merged results¶

The merge_df function is used, once again, to merge the pooling and processed sample results on the pooling_has_output and processed_sample1 keys for the two data frames.

In [11]:
merged_df2 = dp_client.merge_df(merged_df1, ps1_df, "pooling_has_output", "processed_sample1")
merged_df2
Out[11]:
pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id processed_sample1
0 nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122
2 nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122
4 nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122
6 nmdc:bsm-11-e0qcsb54 nmdc:procsm-11-cnz65b78 nmdc:poolp-11-gc19j338 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-e0qcsb54 nmdc:procsm-11-cnz65b78
9 nmdc:bsm-11-3admsx52 nmdc:procsm-11-cnz65b78 nmdc:poolp-11-gc19j338 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-3admsx52 nmdc:procsm-11-cnz65b78
... ... ... ... ... ... ... ...
1069 nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10
1071 nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10
1073 nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079
1074 nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079
1075 nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079

505 rows × 7 columns

4 Get extraction results where processed_sample1 identifier is the has_input to the material_processing_set for Extractions¶

We want to query the material_processing_set, so we use the MaterialProcessing object created earlier, along with the get_record_by_filter function, again (you can see the pattern), but this time we filter where type is nmdc:Extraction, using the processed_sample1 identifier as the has_input for the collection. The names of the fields in the results are adjusted to make it clear which set the inputs, outputs, and ids are from.

In [12]:
# process the processed samples in chunks
result_ids = get_id_list(process_set1, "processed_sample1")
chunked_list = split_list(result_ids)
extraction_set = []
for chunk in chunked_list:
    # create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:Extraction", "has_input": {{"$in": {filter_list}}}}}'
    # get the results
    extraction_set += mp_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)
# clarify names
for extraction in extraction_set:
    extraction["extract_has_input"] = extraction.pop("has_input")
    extraction["extract_has_output"] = extraction.pop("has_output")
    extraction["extract_id"] = extraction.pop("id")

# convert to data frame
extract_df = dp_client.convert_to_df(extraction_set)
extract_df
Out[12]:
extract_has_input extract_has_output extract_id
0 [nmdc:procsm-11-49bwy122] [nmdc:procsm-11-kwaaah42] nmdc:extrp-11-fsv8td81
1 [nmdc:procsm-11-s61wwe09] [nmdc:procsm-11-hnd2nm64] nmdc:extrp-11-8q3xp262
2 [nmdc:procsm-11-cnz65b78] [nmdc:procsm-11-sxnqtz74] nmdc:extrp-11-3334yj37
3 [nmdc:procsm-11-kngzyt90] [nmdc:procsm-11-h9s7h174] nmdc:extrp-11-v25scb12
4 [nmdc:procsm-11-fyx7js23] [nmdc:procsm-11-4yevrf17] nmdc:extrp-11-4frcnb65
... ... ... ...
349 [nmdc:procsm-11-w13fqp71] [nmdc:procsm-11-xbhs5x61] nmdc:extrp-11-7km6zh80
350 [nmdc:procsm-11-eecpt338] [nmdc:procsm-11-gxvm5r54] nmdc:extrp-11-73jns979
351 [nmdc:procsm-11-kxs8m249] [nmdc:procsm-11-kee8xv47] nmdc:extrp-11-g1cazp42
352 [nmdc:procsm-11-rbfspv43] [nmdc:procsm-11-6fat7f34] nmdc:extrp-11-b7kcx022
353 [nmdc:procsm-11-e015da88] [nmdc:procsm-11-878yka43] nmdc:extrp-11-ganvz782

354 rows × 3 columns

4.5 Merge extraction results with the previously merged results¶

The extraction results obtained above are merged with the previously merged results (from step 3.5) using the processed_sample1 field in the previously merged data frame with the extract_has_input from the new extraction results.

In [13]:
merged_df3 = dp_client.merge_df(extract_df, merged_df2, "extract_has_input", "processed_sample1")
merged_df3
Out[13]:
extract_has_input extract_has_output extract_id pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id processed_sample1
0 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122
1 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122
2 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122
3 nmdc:procsm-11-s61wwe09 nmdc:procsm-11-hnd2nm64 nmdc:extrp-11-8q3xp262 nmdc:bsm-11-pd429a61 nmdc:procsm-11-s61wwe09 nmdc:poolp-11-t5n1et05 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-pd429a61 nmdc:procsm-11-s61wwe09
4 nmdc:procsm-11-s61wwe09 nmdc:procsm-11-hnd2nm64 nmdc:extrp-11-8q3xp262 nmdc:bsm-11-9yn2fq77 nmdc:procsm-11-s61wwe09 nmdc:poolp-11-t5n1et05 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-9yn2fq77 nmdc:procsm-11-s61wwe09
... ... ... ... ... ... ... ... ... ... ...
981 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10
982 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10
1001 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079
1002 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079
1003 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079

505 rows × 10 columns

5 Get processed sample results from the output of the extraction results¶

We utilize the ProcessedSampleSearch object again, but this time using the extract_has_output ids to query the set. We only need to return the processed_sample_set identifiers.

In [14]:
# process the processed samples in chunks
result_ids = get_id_list(extraction_set, "extract_has_output")
chunked_list = split_list(result_ids)
process_set2 = []
for chunk in chunked_list:
    # create the filter - query the processed_sample_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:ProcessedSample", "id": {{"$in": {filter_list}}}}}'
    # get the results
    process_set2 += ps_client.get_record_by_filter(filter=filter, fields="id", max_page_size=100, all_pages=True)
# clarify names
for samp in process_set2:
    samp["processed_sample2"] = samp.pop("id")

# convert to data frame
ps2_df = dp_client.convert_to_df(process_set2)
ps2_df
Out[14]:
processed_sample2
0 nmdc:procsm-11-0qx90z87
1 nmdc:procsm-11-0wxpzf07
2 nmdc:procsm-11-1bzpzq15
3 nmdc:procsm-11-1qfgdd16
4 nmdc:procsm-11-1qgqxz62
... ...
343 nmdc:procsm-11-xq1t3650
344 nmdc:procsm-11-yav08109
345 nmdc:procsm-11-ydtgc517
346 nmdc:procsm-11-ze0gdq03
347 nmdc:procsm-11-zr4x7712

348 rows × 1 columns

5.5 Merge the second processed_set results with the previous merged results¶

Using the merge_df function again, the processed_sample2 results are merged with the previously merged set (output of step 4.5) using the processed_sample2 identifiers and the extract_has_output identifiers from the merged set.

In [15]:
merged_df4 = dp_client.merge_df(merged_df3, ps2_df, "extract_has_output", "processed_sample2")
merged_df4
Out[15]:
extract_has_input extract_has_output extract_id pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2
0 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42
1 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42
2 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-nyxsx333 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42
3 nmdc:procsm-11-s61wwe09 nmdc:procsm-11-hnd2nm64 nmdc:extrp-11-8q3xp262 nmdc:bsm-11-pd429a61 nmdc:procsm-11-s61wwe09 nmdc:poolp-11-t5n1et05 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-pd429a61 nmdc:procsm-11-s61wwe09 nmdc:procsm-11-hnd2nm64
5 nmdc:procsm-11-s61wwe09 nmdc:procsm-11-hnd2nm64 nmdc:extrp-11-8q3xp262 nmdc:bsm-11-9yn2fq77 nmdc:procsm-11-s61wwe09 nmdc:poolp-11-t5n1et05 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-9yn2fq77 nmdc:procsm-11-s61wwe09 nmdc:procsm-11-hnd2nm64
... ... ... ... ... ... ... ... ... ... ... ...
1008 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87
1009 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87
1010 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47
1011 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47
1012 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47

505 rows × 11 columns

6. Use MaterialProcessingSearch to get records where type is nmdc:LibraryPreparation¶

Using the processed_sample2 identifiers from the last query as the has_input to filter where type is nmdc:LibraryPreparation, we get a new batch of results, returning the library preparation identifiers, inputs and outputs. The field names are clarified to demonstrate they are from the MaterialProcessingSearch object where the type is nmdc:LibraryPreparation.

In [16]:
# process the material_processing in chunks
result_ids = get_id_list(process_set2, "processed_sample2")
chunked_list = split_list(result_ids)
library_prep_set = []
for chunk in chunked_list:
    # create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:LibraryPreparation", "has_input": {{"$in": {filter_list}}}}}'
    # get the results
    library_prep_set += mp_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)
# clarify names
for prep in library_prep_set:
    prep["lp_has_input"] = prep.pop("has_input")
    prep["lp_has_output"] = prep.pop("has_output")
    prep["lp_id"] = prep.pop("id")

# convert to data frame
lp_df = dp_client.convert_to_df(library_prep_set)
lp_df
Out[16]:
lp_has_input lp_has_output lp_id
0 [nmdc:procsm-11-nay11727] [nmdc:procsm-11-9pdkj890] nmdc:libprp-11-2tnjjj55
1 [nmdc:procsm-11-kwaaah42] [nmdc:procsm-11-kfkbxp22] nmdc:libprp-11-k5j44e20
2 [nmdc:procsm-11-hnd2nm64] [nmdc:procsm-11-as6w8f18] nmdc:libprp-11-h2hy8z17
3 [nmdc:procsm-11-7qy2y664] [nmdc:procsm-11-wd4s5f38] nmdc:libprp-11-wv6p0032
4 [nmdc:procsm-11-sxnqtz74] [nmdc:procsm-11-f06scg15] nmdc:libprp-11-ctwynj07
... ... ... ...
340 [nmdc:procsm-11-kpw8j244] [nmdc:procsm-11-1v407908] nmdc:libprp-11-8ra09y76
341 [nmdc:procsm-11-zr4x7712] [nmdc:procsm-11-vhfb5c18] nmdc:libprp-11-12ph5n93
342 [nmdc:procsm-11-kee8xv47] [nmdc:procsm-11-1eg4r286] nmdc:libprp-11-x8nqhq06
343 [nmdc:procsm-11-6fat7f34] [nmdc:procsm-11-gm915e24] nmdc:libprp-11-874cdm88
344 [nmdc:procsm-11-878yka43] [nmdc:procsm-11-t66cxk50] nmdc:libprp-11-4g7pfm95

345 rows × 3 columns

6.5 Merge library preparation results with previously merged results¶

The library preparation results are merged with the previous results (from step 5.5) using the lp_has_input and the processed_sample2 fields.

In [17]:
merged_df5 = dp_client.merge_df(lp_df, merged_df4, "lp_has_input", "processed_sample2")
merged_df5
Out[17]:
lp_has_input lp_has_output lp_id extract_has_input extract_has_output extract_id pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2
0 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890 nmdc:libprp-11-2tnjjj55 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:extrp-11-574dws05 nmdc:bsm-11-w43vsm21 nmdc:procsm-11-8ec7zx31 nmdc:poolp-11-0ak13p40 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-w43vsm21 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727
1 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890 nmdc:libprp-11-2tnjjj55 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:extrp-11-574dws05 nmdc:bsm-11-dbavm335 nmdc:procsm-11-8ec7zx31 nmdc:poolp-11-0ak13p40 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-dbavm335 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727
2 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890 nmdc:libprp-11-2tnjjj55 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:extrp-11-574dws05 nmdc:bsm-11-4c6er508 nmdc:procsm-11-8ec7zx31 nmdc:poolp-11-0ak13p40 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-4c6er508 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727
3 nmdc:procsm-11-kwaaah42 nmdc:procsm-11-kfkbxp22 nmdc:libprp-11-k5j44e20 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42
4 nmdc:procsm-11-kwaaah42 nmdc:procsm-11-kfkbxp22 nmdc:libprp-11-k5j44e20 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
961 nmdc:procsm-11-64ksxw87 nmdc:procsm-11-bqe26091 nmdc:libprp-11-4ebzbm49 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87
962 nmdc:procsm-11-64ksxw87 nmdc:procsm-11-bqe26091 nmdc:libprp-11-4ebzbm49 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87
972 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310 nmdc:libprp-11-rz4mr176 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47
973 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310 nmdc:libprp-11-rz4mr176 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47
974 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310 nmdc:libprp-11-rz4mr176 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47

505 rows × 14 columns

7 Get third set of proccessed samples from the library preparation output¶

For a third, and last time, we use the ProcessedSampleSearch object, creating the filter using the lp_has_output identifiers. We only return the id field (as processed_sample3)

In [18]:
# process the processed_sample_set in chunks
result_ids = get_id_list(library_prep_set, "lp_has_output")
chunked_list = split_list(result_ids)
process_set3 = []
for chunk in chunked_list:
    # create the filter - query the processed_sample_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:ProcessedSample", "id": {{"$in": {filter_list}}}}}'
    # get the results
    process_set3 += ps_client.get_record_by_filter(filter=filter, fields="id", max_page_size=100, all_pages=True)
# clarify keys
for samp in process_set3:
    samp["processed_sample3"] = samp.pop("id")

# convert to data frame
ps3_df = dp_client.convert_to_df(process_set3)
ps3_df
Out[18]:
processed_sample3
0 nmdc:procsm-11-01k85106
1 nmdc:procsm-11-0tkf2q02
2 nmdc:procsm-11-12hw2r66
3 nmdc:procsm-11-1kf9fn36
4 nmdc:procsm-11-1v407908
... ...
336 nmdc:procsm-11-x27qy119
337 nmdc:procsm-11-xbva4x23
338 nmdc:procsm-11-yqtwwk98
339 nmdc:procsm-11-za57ra10
340 nmdc:procsm-11-zqw3wv67

341 rows × 1 columns

7.5 Merge the third batch of processed samples with the merged data frame¶

The last batch of processed samples are merged with the previously merged data frame (output of step 6.5) using the lp_has_output field and the processed_sample3 field.

In [19]:
merged_df6 = dp_client.merge_df(merged_df5, ps3_df, "lp_has_output", "processed_sample3")
merged_df6
Out[19]:
lp_has_input lp_has_output lp_id extract_has_input extract_has_output extract_id pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2 processed_sample3
0 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890 nmdc:libprp-11-2tnjjj55 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:extrp-11-574dws05 nmdc:bsm-11-w43vsm21 nmdc:procsm-11-8ec7zx31 nmdc:poolp-11-0ak13p40 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-w43vsm21 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890
2 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890 nmdc:libprp-11-2tnjjj55 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:extrp-11-574dws05 nmdc:bsm-11-dbavm335 nmdc:procsm-11-8ec7zx31 nmdc:poolp-11-0ak13p40 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-dbavm335 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890
4 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890 nmdc:libprp-11-2tnjjj55 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:extrp-11-574dws05 nmdc:bsm-11-4c6er508 nmdc:procsm-11-8ec7zx31 nmdc:poolp-11-0ak13p40 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-4c6er508 nmdc:procsm-11-8ec7zx31 nmdc:procsm-11-nay11727 nmdc:procsm-11-9pdkj890
6 nmdc:procsm-11-kwaaah42 nmdc:procsm-11-kfkbxp22 nmdc:libprp-11-k5j44e20 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-5228zz06 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:procsm-11-kfkbxp22
7 nmdc:procsm-11-kwaaah42 nmdc:procsm-11-kfkbxp22 nmdc:libprp-11-k5j44e20 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:extrp-11-fsv8td81 nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:poolp-11-a1nnyd94 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-1frj0t76 nmdc:procsm-11-49bwy122 nmdc:procsm-11-kwaaah42 nmdc:procsm-11-kfkbxp22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
988 nmdc:procsm-11-64ksxw87 nmdc:procsm-11-bqe26091 nmdc:libprp-11-4ebzbm49 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xqtg8327 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:procsm-11-bqe26091
989 nmdc:procsm-11-64ksxw87 nmdc:procsm-11-bqe26091 nmdc:libprp-11-4ebzbm49 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:extrp-11-vg3vzm96 nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:poolp-11-ykrp9878 M horizon USA: Colorado, North Sterling nmdc:bsm-11-z5cmyh06 nmdc:procsm-11-f6kc8b10 nmdc:procsm-11-64ksxw87 nmdc:procsm-11-bqe26091
990 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310 nmdc:libprp-11-rz4mr176 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-stjpwh75 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310
991 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310 nmdc:libprp-11-rz4mr176 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-xngp2r34 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310
992 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310 nmdc:libprp-11-rz4mr176 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:extrp-11-k86nz804 nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:poolp-11-57e94274 M horizon USA: Colorado, North Sterling nmdc:bsm-11-yehx2807 nmdc:procsm-11-dkr9k079 nmdc:procsm-11-s8m02r47 nmdc:procsm-11-68j9y310

505 rows × 15 columns

8 Get data_generation results from the processed sample identifiers¶

Using the third batch of processed sample identifiers, we create a DataGenerationSearch object to utilize the get_record_by_filter function. The filter is built to query on the has_input field. The id and has_input field names are changed to specify that they came from the DataGenerationSearch object.

In [20]:
from nmdc_api_utilities.data_generation_search import DataGenerationSearch
# create a DataGenerationSearch object
dg_client = DataGenerationSearch(env=ENV)
result_ids = get_id_list(process_set3, "processed_sample3")
chunked_list = split_list(result_ids)
data_generation_set = []
for chunk in chunked_list:
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:NucleotideSequencing", "has_input": {{"$in": {filter_list}}}}}'
    # get the results
    data_generation_set += dg_client.get_record_by_filter(filter=filter, fields="has_input,id", max_page_size=100, all_pages=True)

# clarify keys
for dg in data_generation_set:
    dg["dg_has_input"] = dg.pop("has_input")
    dg["dg_id"] = dg.pop("id")

# convert to data frame
dg_df = dp_client.convert_to_df(data_generation_set)
dg_df
Out[20]:
dg_has_input dg_id
0 [nmdc:procsm-11-01k85106] nmdc:dgns-11-wxbab669
1 [nmdc:procsm-11-0tkf2q02] nmdc:omprc-11-2mw7h339
2 [nmdc:procsm-11-12hw2r66] nmdc:dgns-11-8amfa663
3 [nmdc:procsm-11-1kf9fn36] nmdc:dgns-11-see99855
4 [nmdc:procsm-11-1v407908] nmdc:dgns-11-syr2vn62
... ... ...
321 [nmdc:procsm-11-x27qy119] nmdc:omprc-11-hv686d67
322 [nmdc:procsm-11-xbva4x23] nmdc:omprc-11-sz2d4412
323 [nmdc:procsm-11-yqtwwk98] nmdc:omprc-11-2g0n6985
324 [nmdc:procsm-11-za57ra10] nmdc:omprc-11-z00cqd70
325 [nmdc:procsm-11-zqw3wv67] nmdc:dgns-11-48ydpp30

326 rows × 2 columns

8.5 Merge the data_generation_set with the rest of the results¶

The results from querying data generation above are merged with the previously merged results (from step 7.5) using the dg_has_input field and the processed_sample3 field to match on.

In [21]:
merged_df7 = dp_client.merge_df(dg_df, merged_df6, "dg_has_input", "processed_sample3")
merged_df7
Out[21]:
dg_has_input dg_id lp_has_input lp_has_output lp_id extract_has_input extract_has_output extract_id pooling_has_input pooling_has_output pooling_id soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2 processed_sample3
0 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:poolp-11-4ssz6p14 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106
1 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-26khva17 nmdc:procsm-11-s9bpqf04 nmdc:poolp-11-4ssz6p14 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-26khva17 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106
2 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-pf1j8598 nmdc:procsm-11-s9bpqf04 nmdc:poolp-11-4ssz6p14 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-pf1j8598 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106
3 nmdc:procsm-11-0tkf2q02 nmdc:omprc-11-2mw7h339 nmdc:procsm-11-2z8s8m53 nmdc:procsm-11-0tkf2q02 nmdc:libprp-11-ypebxj92 nmdc:procsm-11-7ppgpt30 nmdc:procsm-11-2z8s8m53 nmdc:extrp-11-bspys917 nmdc:bsm-11-geecaz29 nmdc:procsm-11-7ppgpt30 nmdc:poolp-11-t7y1gd11 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-geecaz29 nmdc:procsm-11-7ppgpt30 nmdc:procsm-11-2z8s8m53 nmdc:procsm-11-0tkf2q02
4 nmdc:procsm-11-0tkf2q02 nmdc:omprc-11-2mw7h339 nmdc:procsm-11-2z8s8m53 nmdc:procsm-11-0tkf2q02 nmdc:libprp-11-ypebxj92 nmdc:procsm-11-7ppgpt30 nmdc:procsm-11-2z8s8m53 nmdc:extrp-11-bspys917 nmdc:bsm-11-bnf1p650 nmdc:procsm-11-7ppgpt30 nmdc:poolp-11-t7y1gd11 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-bnf1p650 nmdc:procsm-11-7ppgpt30 nmdc:procsm-11-2z8s8m53 nmdc:procsm-11-0tkf2q02
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
878 nmdc:procsm-11-h57z5224 nmdc:omprc-11-pv9a5n07 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224 nmdc:libprp-11-c9v99696 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 nmdc:extrp-11-3ksezw64 nmdc:bsm-11-sj3j0662 nmdc:procsm-11-ez4jz447 nmdc:poolp-11-casd3207 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-sj3j0662 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224
879 nmdc:procsm-11-h57z5224 nmdc:omprc-11-pv9a5n07 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224 nmdc:libprp-11-c9v99696 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 nmdc:extrp-11-3ksezw64 nmdc:bsm-11-y2x26s57 nmdc:procsm-11-ez4jz447 nmdc:poolp-11-casd3207 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-y2x26s57 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224
928 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-td9fm715 nmdc:procsm-11-vv9mr730 nmdc:poolp-11-zgzjjj76 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-td9fm715 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07
929 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-tdrast31 nmdc:procsm-11-vv9mr730 nmdc:poolp-11-zgzjjj76 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-tdrast31 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07
930 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:poolp-11-zgzjjj76 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07

490 rows × 17 columns

9 Get the metagenome_annotation_set using the data generation identifiers¶

We create a WorkflowExecutionSearch object to query workflow_execution_set. We create a filter using the identifiers obtained from the data generation to match with the was_informed_by field and setting the type field to nmdc:MetagenomeAnnotation. Field names are clarified, once again to specify the collection they came from.

In [22]:
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
# create a WorkflowExecutionSearch object
we_client = WorkflowExecutionSearch(env=ENV)
result_ids = get_id_list(data_generation_set, "dg_id")
chunked_list = split_list(result_ids)
meta_act_ann_set = []
for chunk in chunked_list:
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:MetagenomeAnnotation", "was_informed_by": {{"$in": {filter_list}}}}}'
    # get the results
    meta_act_ann_set += we_client.get_record_by_filter(filter=filter, fields="has_output,was_informed_by,id,version", max_page_size=100, all_pages=True)

# clarify names
for mga in meta_act_ann_set:
    mga["mga_id"] = mga.pop("id")
    mga["mga_version"] = mga.pop("version")
    mga["mga_was_informed_by"] = mga.pop("was_informed_by")
    mga["mga_has_output"] = mga.pop("has_output")

# convert to data frame
mga_df = dp_client.convert_to_df(meta_act_ann_set)
mga_df
Out[22]:
mga_id mga_version mga_was_informed_by mga_has_output
0 nmdc:wfmgan-11-05cdqw41.1 v1.1.5 [nmdc:dgns-11-ekte1238] [nmdc:dobj-11-ndsyd761, nmdc:dobj-11-ss1k0e30,...
1 nmdc:wfmgan-11-0nwd1388.1 v1.0.4 [nmdc:omprc-11-2937gz63] [nmdc:dobj-11-vpaxc956, nmdc:dobj-11-ad42v813,...
2 nmdc:wfmgan-11-0nwd1388.2 v1.0.5 [nmdc:omprc-11-2937gz63] [nmdc:dobj-11-tdjwam92, nmdc:dobj-11-d55djd72,...
3 nmdc:wfmgan-11-0r142238.1 v1.0.4 [nmdc:omprc-11-th7v6711] [nmdc:dobj-11-yfvgh831, nmdc:dobj-11-0ykk7811,...
4 nmdc:wfmgan-11-14gcar54.1 v1.0.4 [nmdc:omprc-11-px5df021] [nmdc:dobj-11-dy2jsc18, nmdc:dobj-11-3amwd664,...
... ... ... ... ...
345 nmdc:wfmgan-11-aqymgy87.1 v1.0.4 [nmdc:omprc-11-t0espr14] [nmdc:dobj-11-nr50z869, nmdc:dobj-11-6g3ka432,...
346 nmdc:wfmgan-11-aqymgy87.2 v1.1.0 [nmdc:omprc-11-t0espr14] [nmdc:dobj-11-6284dm29, nmdc:dobj-11-dhrq5t83,...
347 nmdc:wfmgan-11-jddrcn33.1 v1.0.4 [nmdc:omprc-11-t5v1jk63] [nmdc:dobj-11-jpjc0875, nmdc:dobj-11-9qphny43,...
348 nmdc:wfmgan-11-0r142238.1 v1.0.4 [nmdc:omprc-11-th7v6711] [nmdc:dobj-11-yfvgh831, nmdc:dobj-11-0ykk7811,...
349 nmdc:wfmgan-11-gyg96p67.1 v1.0.4 [nmdc:omprc-11-z00cqd70] [nmdc:dobj-11-xrb37t71, nmdc:dobj-11-wfdvxf91,...

350 rows × 4 columns

9.5 Merge metagenome activity results with the previously merged results¶

The metagenome activity results obtained above are merged with the previously combined results (from step 8.5), matching on the dg_id and mga_was_informed_by fields.

In [23]:
merged_df8 = dp_client.merge_df(merged_df7, mga_df,  "dg_id", "mga_was_informed_by")
merged_df8
Out[23]:
dg_has_input dg_id lp_has_input lp_has_output lp_id extract_has_input extract_has_output extract_id pooling_has_input pooling_has_output ... soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2 processed_sample3 mga_id mga_version mga_was_informed_by mga_has_output
0 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:wfmgan-11-ne5fap84.1 v1.1.5 nmdc:dgns-11-wxbab669 nmdc:dobj-11-5p1xav61
1 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:wfmgan-11-ne5fap84.1 v1.1.5 nmdc:dgns-11-wxbab669 nmdc:dobj-11-kcg9tc71
2 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:wfmgan-11-ne5fap84.1 v1.1.5 nmdc:dgns-11-wxbab669 nmdc:dobj-11-xtqnaj25
3 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:wfmgan-11-ne5fap84.1 v1.1.5 nmdc:dgns-11-wxbab669 nmdc:dobj-11-v8k16016
4 nmdc:procsm-11-01k85106 nmdc:dgns-11-wxbab669 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:libprp-11-sqmba015 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:extrp-11-wewd5f59 nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-9v0epr64 nmdc:procsm-11-s9bpqf04 nmdc:procsm-11-m6gcps44 nmdc:procsm-11-01k85106 nmdc:wfmgan-11-ne5fap84.1 v1.1.5 nmdc:dgns-11-wxbab669 nmdc:dobj-11-m2v1va20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
24153 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-mhp3k924
24154 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-retvmq75
24155 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-aw5x5j27
24156 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-1fq0xm75
24157 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:extrp-11-6rjaph92 nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-6fb6r674

13009 rows × 21 columns

10 Get data objects from the metagenome activity result outputs¶

We create a DataObjectSearch object to utilize get_record_by_filter. We create a filter to query the mga_has_output identifiers to match the id field in the data objects. Since this is the final query, the filter parameter is slightly different than the rest of the queries. We specify that that we need to retrieve all results where the data_object_type has a value of Scaffold Lineage tsv (since this has contig taxonomy results) given the list of identifiers. Note that the url is a new field returned that contains the tsvs we will need for the final analysis.

In [24]:
from nmdc_api_utilities.data_object_search import DataObjectSearch
# create a DataObjectSearch object
do_client = DataObjectSearch(env=ENV)
result_ids = get_id_list(meta_act_ann_set, "mga_has_output")
chunked_list = split_list(result_ids)
data_ob_set = []
for chunk in chunked_list:
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"type": "nmdc:DataObject", "data_object_type": "Scaffold Lineage tsv", "id": {{"$in": {filter_list}}}}}'
    # get the results
    data_ob_set += do_client.get_record_by_filter(filter=filter, fields="id,data_object_type,url", max_page_size=100, all_pages=True)

# clarify fields
for ob in data_ob_set:
    ob["data_ob_id"] = ob.pop("id")

# convert to data frame
do_df = dp_client.convert_to_df(data_ob_set)
do_df
Out[24]:
data_object_type url data_ob_id
0 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69
1 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-8sttbc64
2 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-g7xsfb88
3 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-q3de6z81
4 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... nmdc:dobj-11-ykw2tv02
... ... ... ...
345 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-g87j5y46
346 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-nzmgqh66
347 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-ven5zv88
348 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69
349 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-8mc4sk45

350 rows × 3 columns

10.5 Merge one last time¶

For the final merge, we merge the data object results obtained above with the rest of our combined results, matching the data_ob_id key with the mga_has_output key.

In [25]:
merged_df9 = dp_client.merge_df(do_df, merged_df8, "data_ob_id", "mga_has_output")
merged_df9
Out[25]:
data_object_type url data_ob_id dg_has_input dg_id lp_has_input lp_has_output lp_id extract_has_input extract_has_output ... soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2 processed_sample3 mga_id mga_version mga_was_informed_by mga_has_output
0 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69 nmdc:procsm-11-ngwm7252 nmdc:omprc-11-th7v6711 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:libprp-11-nyvvd758 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-k0qje589 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:wfmgan-11-0r142238.1 v1.0.4 nmdc:omprc-11-th7v6711 nmdc:dobj-11-1apwza69
1 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69 nmdc:procsm-11-ngwm7252 nmdc:omprc-11-th7v6711 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:libprp-11-nyvvd758 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-7fkbrp42 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:wfmgan-11-0r142238.1 v1.0.4 nmdc:omprc-11-th7v6711 nmdc:dobj-11-1apwza69
2 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69 nmdc:procsm-11-ngwm7252 nmdc:omprc-11-th7v6711 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:libprp-11-nyvvd758 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-wpzp9996 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:wfmgan-11-0r142238.1 v1.0.4 nmdc:omprc-11-th7v6711 nmdc:dobj-11-1apwza69
3 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-8sttbc64 nmdc:procsm-11-3s5m9a70 nmdc:omprc-11-2937gz63 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:libprp-11-a6yw0y51 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 ... M horizon USA: Colorado, North Sterling nmdc:bsm-11-m6r77j31 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:wfmgan-11-0nwd1388.1 v1.0.4 nmdc:omprc-11-2937gz63 nmdc:dobj-11-8sttbc64
4 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-8sttbc64 nmdc:procsm-11-3s5m9a70 nmdc:omprc-11-2937gz63 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:libprp-11-a6yw0y51 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 ... M horizon USA: Colorado, North Sterling nmdc:bsm-11-1hkmx038 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:wfmgan-11-0nwd1388.1 v1.0.4 nmdc:omprc-11-2937gz63 nmdc:dobj-11-8sttbc64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
986 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-a1v78a54 nmdc:procsm-11-h57z5224 nmdc:omprc-11-pv9a5n07 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224 nmdc:libprp-11-c9v99696 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-sj3j0662 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224 nmdc:wfmgan-11-vqpdns38.1 v1.1.5 nmdc:omprc-11-pv9a5n07 nmdc:dobj-11-a1v78a54
987 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-a1v78a54 nmdc:procsm-11-h57z5224 nmdc:omprc-11-pv9a5n07 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224 nmdc:libprp-11-c9v99696 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-y2x26s57 nmdc:procsm-11-ez4jz447 nmdc:procsm-11-wzg4jt58 nmdc:procsm-11-h57z5224 nmdc:wfmgan-11-vqpdns38.1 v1.1.5 nmdc:omprc-11-pv9a5n07 nmdc:dobj-11-a1v78a54
991 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-vvxc2g29 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-td9fm715 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-vvxc2g29
992 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-vvxc2g29 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-tdrast31 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-vvxc2g29
993 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-vvxc2g29 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-vvxc2g29

545 rows × 24 columns

Clean up the combined results¶

Select a workflow version of these results to use in this analysis

In [26]:
versioned_df10 = merged_df9[merged_df9['mga_version'].str.contains('v1.0.4',na=False)]
versioned_df10
Out[26]:
data_object_type url data_ob_id dg_has_input dg_id lp_has_input lp_has_output lp_id extract_has_input extract_has_output ... soil_horizon geo_loc_name biosample_id processed_sample1 processed_sample2 processed_sample3 mga_id mga_version mga_was_informed_by mga_has_output
0 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69 nmdc:procsm-11-ngwm7252 nmdc:omprc-11-th7v6711 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:libprp-11-nyvvd758 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-k0qje589 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:wfmgan-11-0r142238.1 v1.0.4 nmdc:omprc-11-th7v6711 nmdc:dobj-11-1apwza69
1 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69 nmdc:procsm-11-ngwm7252 nmdc:omprc-11-th7v6711 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:libprp-11-nyvvd758 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-7fkbrp42 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:wfmgan-11-0r142238.1 v1.0.4 nmdc:omprc-11-th7v6711 nmdc:dobj-11-1apwza69
2 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-1apwza69 nmdc:procsm-11-ngwm7252 nmdc:omprc-11-th7v6711 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:libprp-11-nyvvd758 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 ... M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-wpzp9996 nmdc:procsm-11-g4jv1f71 nmdc:procsm-11-kjvc3e42 nmdc:procsm-11-ngwm7252 nmdc:wfmgan-11-0r142238.1 v1.0.4 nmdc:omprc-11-th7v6711 nmdc:dobj-11-1apwza69
3 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-8sttbc64 nmdc:procsm-11-3s5m9a70 nmdc:omprc-11-2937gz63 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:libprp-11-a6yw0y51 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 ... M horizon USA: Colorado, North Sterling nmdc:bsm-11-m6r77j31 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:wfmgan-11-0nwd1388.1 v1.0.4 nmdc:omprc-11-2937gz63 nmdc:dobj-11-8sttbc64
4 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-8sttbc64 nmdc:procsm-11-3s5m9a70 nmdc:omprc-11-2937gz63 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:libprp-11-a6yw0y51 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 ... M horizon USA: Colorado, North Sterling nmdc:bsm-11-1hkmx038 nmdc:procsm-11-w5zzjm84 nmdc:procsm-11-yz8wab55 nmdc:procsm-11-3s5m9a70 nmdc:wfmgan-11-0nwd1388.1 v1.0.4 nmdc:omprc-11-2937gz63 nmdc:dobj-11-8sttbc64
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
908 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-wkznj445 nmdc:procsm-11-yqtwwk98 nmdc:omprc-11-2g0n6985 nmdc:procsm-11-ze0gdq03 nmdc:procsm-11-yqtwwk98 nmdc:libprp-11-2mjwz291 nmdc:procsm-11-86beb994 nmdc:procsm-11-ze0gdq03 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-tpzf1a43 nmdc:procsm-11-86beb994 nmdc:procsm-11-ze0gdq03 nmdc:procsm-11-yqtwwk98 nmdc:wfmgan-11-wpvhfk84.1 v1.0.4 nmdc:omprc-11-2g0n6985 nmdc:dobj-11-wkznj445
909 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-wkznj445 nmdc:procsm-11-yqtwwk98 nmdc:omprc-11-2g0n6985 nmdc:procsm-11-ze0gdq03 nmdc:procsm-11-yqtwwk98 nmdc:libprp-11-2mjwz291 nmdc:procsm-11-86beb994 nmdc:procsm-11-ze0gdq03 ... O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-njy52033 nmdc:procsm-11-86beb994 nmdc:procsm-11-ze0gdq03 nmdc:procsm-11-yqtwwk98 nmdc:wfmgan-11-wpvhfk84.1 v1.0.4 nmdc:omprc-11-2g0n6985 nmdc:dobj-11-wkznj445
991 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-vvxc2g29 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-td9fm715 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-vvxc2g29
992 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-vvxc2g29 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-tdrast31 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-vvxc2g29
993 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:dobj-11-vvxc2g29 nmdc:procsm-11-w7vnsk07 nmdc:omprc-11-r9wnp831 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:libprp-11-8z0dcm53 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 ... M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zaccf569 nmdc:procsm-11-vv9mr730 nmdc:procsm-11-vvgh6z28 nmdc:procsm-11-w7vnsk07 nmdc:wfmgan-11-e3s88g45.1 v1.0.4 nmdc:omprc-11-r9wnp831 nmdc:dobj-11-vvxc2g29

241 rows × 24 columns

In the final step of retrieving and cleaning the data, we clean up the final merged data frame by removing all of the "joining columns" that are not needed in our final analysis. This included most of the identifier columns including biosample_id to avoid redundant downloads when multiple biosamples fed to the same processed result. The only columns we retain are the soil_horizon, geo_loc_name, data_ob_id, and the url to the tsv. The final_df is displayed.

In [27]:
column_list = versioned_df10.columns.tolist()
columns_to_keep = ["soil_horizon", "url", "geo_loc_name", "data_ob_id"]
columns_to_remove = list(set(column_list).difference(columns_to_keep))
# Drop unnecessary rows
df10_cleaned = versioned_df10.drop(columns=columns_to_remove)

# remove duplicates
df10_cleaned.drop_duplicates(keep="first", inplace=True)

# check rows when we reagrregate/implode, 
final_df = df10_cleaned.groupby(["soil_horizon", "geo_loc_name", "data_ob_id"]).agg({"url": list}).reset_index()

final_df
Out[27]:
soil_horizon geo_loc_name data_ob_id url
0 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-1apwza69 [https://data.microbiomedata.org/data/nmdc:omp...
1 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-1nkd9110 [https://data.microbiomedata.org/data/nmdc:omp...
2 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-25ndys70 [https://data.microbiomedata.org/data/nmdc:omp...
3 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-5apjp861 [https://data.microbiomedata.org/data/nmdc:omp...
4 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-63rwtd28 [https://data.microbiomedata.org/data/nmdc:omp...
... ... ... ... ...
79 O horizon USA: Colorado, Niwot Ridge nmdc:dobj-11-wkznj445 [https://data.microbiomedata.org/data/nmdc:omp...
80 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-jp45gr33 [https://data.microbiomedata.org/data/nmdc:omp...
81 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-n7hhax28 [https://data.microbiomedata.org/data/nmdc:omp...
82 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-s7dphe48 [https://data.microbiomedata.org/data/nmdc:omp...
83 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-z73a1f14 [https://data.microbiomedata.org/data/nmdc:omp...

84 rows × 4 columns

Change the url column from a list to a string¶

In order to open the tsv urls, the structure of the url column will need to be changed from a list to a string in order to properly open the tsvs.

In [28]:
final_df["url"] = final_df["url"].apply(lambda x: ', '.join(map(str, x)))
final_df
Out[28]:
soil_horizon geo_loc_name data_ob_id url
0 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-1apwza69 https://data.microbiomedata.org/data/nmdc:ompr...
1 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-1nkd9110 https://data.microbiomedata.org/data/nmdc:ompr...
2 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-25ndys70 https://data.microbiomedata.org/data/nmdc:ompr...
3 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-5apjp861 https://data.microbiomedata.org/data/nmdc:ompr...
4 M horizon USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-63rwtd28 https://data.microbiomedata.org/data/nmdc:ompr...
... ... ... ... ...
79 O horizon USA: Colorado, Niwot Ridge nmdc:dobj-11-wkznj445 https://data.microbiomedata.org/data/nmdc:ompr...
80 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-jp45gr33 https://data.microbiomedata.org/data/nmdc:ompr...
81 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-n7hhax28 https://data.microbiomedata.org/data/nmdc:ompr...
82 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-s7dphe48 https://data.microbiomedata.org/data/nmdc:ompr...
83 O horizon USA: Colorado, Rocky Mountains nmdc:dobj-11-z73a1f14 https://data.microbiomedata.org/data/nmdc:ompr...

84 rows × 4 columns

Show how many results have M horizon vs. O horizon¶

The soil_horizon column can be counted using the value_counts() functionality. There are many more M horizon samples than O horizon.

In [29]:
# Show unique soil horizons:
soil_horizons = final_df['soil_horizon'].value_counts()
print(soil_horizons)
soil_horizon
M horizon    68
O horizon    16
Name: count, dtype: int64

Randomly select a subset of these datasets for which to pull information¶

In [30]:
# randomly select 15 data sets in each horizon
n = 15

#list the different types
list_type=soil_horizons.index.tolist()

#for each type, randomly horizon n data sets and save them into list
random_subset=[]
for type in list_type:
    #each data object ID and horizon type
    sample_type=final_df[['data_ob_id','soil_horizon']].drop_duplicates()
    #filter to current horizon type
    sample_type=sample_type[sample_type['soil_horizon']==type]
    #randomly horizon n data object IDs in current horizon type
    sample_type=sample_type.sample(n=n, random_state=2)
    #save
    random_subset.append(sample_type)

#resave list as dataframe
random_subset=pd.concat(random_subset).reset_index(drop=True)

#remerge rest of the data for the sampled data sets
final_df=random_subset.merge(final_df,on=['data_ob_id','soil_horizon'],how="left")

final_df
Out[30]:
data_ob_id soil_horizon geo_loc_name url
0 nmdc:dobj-11-zk896n23 M horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
1 nmdc:dobj-11-c1dvnq21 M horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
2 nmdc:dobj-11-8mc4sk45 M horizon USA: Colorado, North Sterling https://data.microbiomedata.org/data/nmdc:ompr...
3 nmdc:dobj-11-k1vrrb83 M horizon USA: Colorado, North Sterling https://data.microbiomedata.org/data/nmdc:ompr...
4 nmdc:dobj-11-q96s7s63 M horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
5 nmdc:dobj-11-ngtdmd88 M horizon USA: Colorado, Rocky Mountains https://data.microbiomedata.org/data/nmdc:ompr...
6 nmdc:dobj-11-r371w335 M horizon USA: Colorado, North Sterling https://data.microbiomedata.org/data/nmdc:ompr...
7 nmdc:dobj-11-1apwza69 M horizon USA: Colorado, Central Plains Experimental Range https://data.microbiomedata.org/data/nmdc:ompr...
8 nmdc:dobj-11-yebxx995 M horizon USA: Colorado, North Sterling https://data.microbiomedata.org/data/nmdc:ompr...
9 nmdc:dobj-11-605rmv44 M horizon USA: Colorado, North Sterling https://data.microbiomedata.org/data/nmdc:ompr...
10 nmdc:dobj-11-f1pg9z40 M horizon USA: Colorado, Central Plains Experimental Range https://data.microbiomedata.org/data/nmdc:ompr...
11 nmdc:dobj-11-nzmgqh66 M horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
12 nmdc:dobj-11-xptdf353 M horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
13 nmdc:dobj-11-1nkd9110 M horizon USA: Colorado, Central Plains Experimental Range https://data.microbiomedata.org/data/nmdc:ompr...
14 nmdc:dobj-11-1mwrks28 M horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
15 nmdc:dobj-11-jp45gr33 O horizon USA: Colorado, Rocky Mountains https://data.microbiomedata.org/data/nmdc:ompr...
16 nmdc:dobj-11-ht5msd46 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
17 nmdc:dobj-11-nem7e417 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
18 nmdc:dobj-11-8ybd1f87 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
19 nmdc:dobj-11-v1d0fe44 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
20 nmdc:dobj-11-gargwe62 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
21 nmdc:dobj-11-bxdpkq28 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
22 nmdc:dobj-11-ven5zv88 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
23 nmdc:dobj-11-qzb4kt50 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
24 nmdc:dobj-11-s7dphe48 O horizon USA: Colorado, Rocky Mountains https://data.microbiomedata.org/data/nmdc:ompr...
25 nmdc:dobj-11-g97pjb32 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
26 nmdc:dobj-11-wkznj445 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
27 nmdc:dobj-11-q3de6z81 O horizon USA: Colorado, Niwot Ridge https://data.microbiomedata.org/data/nmdc:ompr...
28 nmdc:dobj-11-z73a1f14 O horizon USA: Colorado, Rocky Mountains https://data.microbiomedata.org/data/nmdc:ompr...
29 nmdc:dobj-11-n7hhax28 O horizon USA: Colorado, Rocky Mountains https://data.microbiomedata.org/data/nmdc:ompr...

Example of what the TSV contig taxa file looks like¶

A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is 1.0. However, there are duplicate rows of taxa, meaning there are actually more than 1.0 for several taxa (though they appear as duplicate rows with 1.0 as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.

In [31]:
tsv_ex_url = final_df.at[0, "url"]

response = requests.get(tsv_ex_url)
tsv_data = StringIO(response.text)

tsv_ex_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()

# Give columns names
tsv_ex_df.columns = ["contig_id", "taxa", "initial_count"]

# sort by taxa
tsv_sorted = tsv_ex_df.sort_values(by="taxa")

# print first 10 rows
tsv_sorted[:10]
Out[31]:
contig_id taxa initial_count
2445 nmdc:wfmgas-11-zkxntv88.1_scf_12576_c1 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0
10714 nmdc:wfmgas-11-zkxntv88.1_scf_21632_c1 Archaea;Euryarchaeota;Halobacteria;Halobacteri... 1.0
12319 nmdc:wfmgas-11-zkxntv88.1_scf_23473_c1 Archaea;Euryarchaeota;Halobacteria;Halobacteri... 1.0
5857 nmdc:wfmgas-11-zkxntv88.1_scf_16241_c1 Archaea;Euryarchaeota;Halobacteria;Halobacteri... 1.0
4703 nmdc:wfmgas-11-zkxntv88.1_scf_14997_c1 Archaea;Euryarchaeota;Halobacteria;Haloferacal... 1.0
4311 nmdc:wfmgas-11-zkxntv88.1_scf_14578_c1 Archaea;Euryarchaeota;Halobacteria;Haloferacal... 1.0
1698 nmdc:wfmgas-11-zkxntv88.1_scf_11797_c1 Archaea;Euryarchaeota;Halobacteria;Haloferacal... 1.0
825 nmdc:wfmgas-11-zkxntv88.1_scf_10891_c1 Archaea;Euryarchaeota;Halobacteria;Haloferacal... 1.0
12800 nmdc:wfmgas-11-zkxntv88.1_scf_24029_c1 Archaea;Euryarchaeota;Methanobacteria;Methanob... 1.0
8056 nmdc:wfmgas-11-zkxntv88.1_scf_18678_c1 Archaea;Euryarchaeota;Methanobacteria;Methanob... 1.0

Iterate throught the TSVs to get the contig taxa information¶

Using the Python requests library and the StringIO library, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa. A grouping function is performed on the taxa column and the Pandas size() functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each biosample. After iterating through all of the TSVs, two final taxa dfs are created by concatenating the list of data frames (o_df and m_df).

Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected.

Note this takes several hours to complete.

In [32]:
o_horizon = []
m_horizon = []
errors = []

iteration_counter = 0


for index, row in final_df.iterrows():
    
    iteration_counter += 1

    # print an update for every 50 iterations
    if iteration_counter % 50 == 0:
        print(f"Processed {iteration_counter} rows")

    url = row["url"]
    horizon = row["soil_horizon"]
    dataobj = row["data_ob_id"]
    geo_loc = row["geo_loc_name"]
    data_ob_id = row["data_ob_id"]

    try:
        response = requests.get(url)
        tsv_data = StringIO(response.text)
    
        tsv_df = pd.read_csv(tsv_data, delimiter="\t")
        tsv_data.close()
    
        # Give columns names
        tsv_df.columns = ["contig_id", "taxa", "initial_count"]
    
        # split taxa column into a list where a semicolon (;) is the delimeter
        tsv_df["taxa"] = tsv_df["taxa"].str.split(";")

        # Get only the third element of the list of taxa (the phylum), add "Unknown" it it does not include phylum level, and add
        # "Unkown" if the taxa value is empty.
        tsv_df["taxa"] = tsv_df["taxa"].apply(lambda x: str(x[2]) if isinstance(x, list) and len(x) >= 3 
                                              else str(" ".join(x) + " Unknown") if isinstance(x, list) else "Unknown")


        # Get relative abundance for the tsv_df
        tsv_df = tsv_df.groupby("taxa").size().reset_index(name="count")
        total_count = tsv_df["count"].sum()
        tsv_df["relative_abundance"] = (tsv_df["count"] / total_count) * 100

        # Add geo location to data frame
        tsv_df["geo_loc_name"] = geo_loc

        # Add biosample id to data frame
        tsv_df["data_ob_id"] = dataobj
        tsv_df["tsv_url"] = url

        # append tsv_df to list depending on the soil horizon type
        if horizon == "O horizon":
            o_horizon.append(tsv_df)
        else:
            m_horizon.append(tsv_df)

    except Exception as e:
        print(f"An error occurred: {e}")
        errors.append({
            "data_ob_id": dataobj,
            "url": url,
            "horizon": horizon,
            "geo_loc_name": geo_loc, 
            "data_ob_id": data_ob_id
            })
        continue

# concatenate list of dfs
o_df = pd.concat(o_horizon)
m_df = pd.concat(m_horizon)

m_df
Out[32]:
taxa count relative_abundance geo_loc_name data_ob_id tsv_url
0 Acidimicrobiia 69 0.335196 USA: Colorado, Niwot Ridge nmdc:dobj-11-zk896n23 https://data.microbiomedata.org/data/nmdc:ompr...
1 Acidithiobacillia 3 0.014574 USA: Colorado, Niwot Ridge nmdc:dobj-11-zk896n23 https://data.microbiomedata.org/data/nmdc:ompr...
2 Actinomycetes 12154 59.042992 USA: Colorado, Niwot Ridge nmdc:dobj-11-zk896n23 https://data.microbiomedata.org/data/nmdc:ompr...
3 Agaricomycetes 8 0.038863 USA: Colorado, Niwot Ridge nmdc:dobj-11-zk896n23 https://data.microbiomedata.org/data/nmdc:ompr...
4 Alphaproteobacteria 3912 19.004129 USA: Colorado, Niwot Ridge nmdc:dobj-11-zk896n23 https://data.microbiomedata.org/data/nmdc:ompr...
... ... ... ... ... ... ...
291 unclassified Zoopagomycota 20 0.000505 USA: Colorado, Niwot Ridge nmdc:dobj-11-1mwrks28 https://data.microbiomedata.org/data/nmdc:ompr...
292 unclassified candidate division NC10 1819 0.045958 USA: Colorado, Niwot Ridge nmdc:dobj-11-1mwrks28 https://data.microbiomedata.org/data/nmdc:ompr...
293 unclassified candidate division Zixibacteria 240 0.006064 USA: Colorado, Niwot Ridge nmdc:dobj-11-1mwrks28 https://data.microbiomedata.org/data/nmdc:ompr...
294 unclassified dsDNA viruses, no RNA stage 10 0.000253 USA: Colorado, Niwot Ridge nmdc:dobj-11-1mwrks28 https://data.microbiomedata.org/data/nmdc:ompr...
295 unclassified viruses 2 0.000051 USA: Colorado, Niwot Ridge nmdc:dobj-11-1mwrks28 https://data.microbiomedata.org/data/nmdc:ompr...

2582 rows × 6 columns

Look into any errors that occurred from the TSV requests¶

Any TSVs that could not be requested were added to an errors dictionary.

In [33]:
print(errors)
[]

Define a function to calculate abundance¶

A function is defined that takes an input of a dataframe and calculates the average relative abundance of each taxa.

In [34]:
def taxa_abundance(df):

    df = df.drop_duplicates(subset=['data_ob_id', 'taxa'])

    # pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
    wide_df = df.pivot(index = "data_ob_id", columns = "taxa", values = "relative_abundance")
    wide_df = wide_df.fillna(0)
    wide_df.reset_index(inplace=True)
    
    # convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
    melted_df = pd.melt(wide_df, id_vars = "data_ob_id", var_name = "taxa", value_name = "relative_abundance")

    # calculate abundance and add column to data frame
    final_df = melted_df.groupby("taxa")["relative_abundance"].mean().reset_index(name="avg_relative_abundance")

    return final_df

Calculate the abundance of the O and M horizon data frames¶

Using the function defined above, the counts_m and counts_o data frames returned from iterating over the TSV files are used as input into the function, where the average relative abundance calculations are returned as data frames. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.

In [35]:
# caculate abundance for each soil horizon type and get top 25 taxa, grouping the rest
m_final = taxa_abundance(m_df)
o_final = taxa_abundance(o_df)

# combine data frames
o_final["soil_horizon"] = "O"
m_final["soil_horizon"] = "M"
abundance_df = pd.concat([o_final, m_final])

abundance_df
Out[35]:
taxa avg_relative_abundance soil_horizon
0 Acidimicrobiia 0.324138 O
1 Acidithiobacillia 0.031115 O
2 Aconoidasida 0.002462 O
3 Actinomycetes 24.049704 O
4 Actinopteri 0.001235 O
... ... ... ...
300 unclassified Zoopagomycota 0.000070 M
301 unclassified candidate division NC10 0.106987 M
302 unclassified candidate division Zixibacteria 0.012937 M
303 unclassified dsDNA viruses, no RNA stage 0.000034 M
304 unclassified viruses 0.000952 M

604 rows × 3 columns

Plot the taxa abundance of M vs. O horizon soil samples¶

Using the plotly library, the percent abundance of the taxa is plotted as a bar chart - each bar representing the soil horizon and the colors representing the taxa.

In [36]:
# Plot the taxa abundance of each soil type
fig = px.bar(abundance_df, x="soil_horizon", y="avg_relative_abundance", color="taxa", 
             title = "% Abundance of phylum-level taxa in M and O horizon soil samples in Colorado", 
             labels = {"soil_horizon": "Soil Horizon", "avg_relative_abundance": "% Abundance"})
    
fig.update_layout(height=600)
fig.show()          
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/plotly/express/_core.py:2065: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.

Write a function to calculate the abundance per location¶

This is a function to use with the m_df and o_df outputs from the TSV iteration to calculate the % abundance for each geo_loc_name. It also groups the taxa after the top 5 for each loaction into "other".

In [37]:
def loc_abund(df):

    df = df.drop_duplicates(subset=['data_ob_id', 'taxa'])

    # pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
    wide_df = df.pivot(index = "data_ob_id", columns = "taxa", values = "relative_abundance")
    wide_df = wide_df.fillna(0)
    wide_df.reset_index(inplace=True)

    # Add geo_loc_name column to wide_df
    wide_df = pd.merge(wide_df, df[['data_ob_id', 'geo_loc_name']], on='data_ob_id', how='left')
    
    # convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
    melted_df = pd.melt(wide_df, id_vars=["data_ob_id", "geo_loc_name"], var_name="taxa", value_name="relative_abundance")

    final_df = melted_df.groupby(["geo_loc_name", "taxa"])["relative_abundance"].mean().reset_index(name="avg_relative_abundance")

    return final_df

Calculate the abundance of the location data frames¶

Using the function defined above, the m_df and the o_df data frames returned from iterating over the TSV files are used as input into the function, where the final abundance calculations and top 5 taxa are returned as data frames. We do the calculation by grouping by geo_loc_name in order to calculate abundances per location. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.

In [38]:
# caculate abundance for each soil horizon type and get top 5 taxa, grouping the rest
m_loc = loc_abund(m_df)
o_loc = loc_abund(o_df)

# combine data frames
o_loc["soil_horizon"] = "O"
m_loc["soil_horizon"] = "M"
loc_abund_df = pd.concat([o_loc, m_loc])

# Extract only region names from geo_loc_name
loc_abund_df["location"] = loc_abund_df["geo_loc_name"].str.extract(r'Colorado, (.*)')

loc_abund_df
Out[38]:
geo_loc_name taxa avg_relative_abundance soil_horizon location
0 USA: Colorado, Niwot Ridge Acidimicrobiia 0.295080 O Niwot Ridge
1 USA: Colorado, Niwot Ridge Acidithiobacillia 0.035143 O Niwot Ridge
2 USA: Colorado, Niwot Ridge Aconoidasida 0.003006 O Niwot Ridge
3 USA: Colorado, Niwot Ridge Actinomycetes 22.616634 O Niwot Ridge
4 USA: Colorado, Niwot Ridge Actinopteri 0.001069 O Niwot Ridge
... ... ... ... ... ...
1215 USA: Colorado, Rocky Mountains unclassified Zoopagomycota 0.000000 M Rocky Mountains
1216 USA: Colorado, Rocky Mountains unclassified candidate division NC10 0.065346 M Rocky Mountains
1217 USA: Colorado, Rocky Mountains unclassified candidate division Zixibacteria 0.008262 M Rocky Mountains
1218 USA: Colorado, Rocky Mountains unclassified dsDNA viruses, no RNA stage 0.000000 M Rocky Mountains
1219 USA: Colorado, Rocky Mountains unclassified viruses 0.000751 M Rocky Mountains

1818 rows × 5 columns

Plot the taxa abundance of M and O horizon soil samples for each location¶

Using the plotly library, the percent abundance of the taxa is plotted as a bar chart for each geo location and faceted by soil horizon.

In [39]:
geo_fig = px.bar(loc_abund_df, x = "soil_horizon", y="avg_relative_abundance", color = "taxa", 
                 facet_col = "location",
                 facet_col_spacing = 0.1,
                 title = "% Abundance of phylum-level taxa in M and O horizon samples for each Colorado location", 
                 labels = {"geo_loc_name": "Location", "avg_relative_abundance": "% Abundance"},
                 height = 600)
# update figure to remove "location=" from facet column labels
geo_fig.for_each_annotation(lambda a: a.update(text=a.text.replace("location=", "")))

# show figure
geo_fig.show()