How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?¶
This notebook uses the nmdc_api_utilities package (as of March 2025) to explore how the taxononomic distribution of contigs differ by the mineral and organic soil layers in Colorado. It involves using nmdc_api_utilites objects to make NMDC API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Iterating through the TSV files includes 350+ API calls to get the necessary taxonomic counts and is time consuming.
import requests
import pandas as pd
from io import StringIO
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import nmdc_api_utilities
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/kaleido/_sync_server.py:11: UserWarning: Warning: You have Plotly version 5.18.0, which is not compatible with this version of Kaleido (1.2.0). This means that static image generation (e.g. `fig.write_image()`) will not work. Please upgrade Plotly to version 6.1.1 or greater, or downgrade Kaleido to version 0.2.1.
1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name¶
The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Moutains" for the geo_loc_name.
Using the Python package 'nmdc_api_utilities', we can use the get_record_by_filter function to do this. We first need create a BiosampleSearch object to search across the "biosample_set" collections. More information regarding the nmdc_api_utilities package can be found here. We then create a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is returned no matter what. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead. Finally, we convert the biosample results to a dataframe called biosample_df. Note that about 517 biosamples are returned.
from nmdc_api_utilities.biosample_search import BiosampleSearch
from nmdc_api_utilities.data_processing import DataProcessing
# Create a BiosampleSearch object
bs_client = BiosampleSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# define the filter
filter = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}'
# get the results
bs_results = bs_client.get_record_by_filter(filter=filter, fields="id,soil_horizon,geo_loc_name", max_page_size=100, all_pages=True)
# clarify names
for biosample in bs_results:
biosample["biosample_id"] = biosample.pop("id")
# convert to df
biosample_df = dp_client.convert_to_df(bs_results)
# Adjust geo_loc_name to not be a dictionary
biosample_df["geo_loc_name"] = biosample_df["geo_loc_name"].apply(lambda x: x.get("has_raw_value"))
biosample_df
| soil_horizon | geo_loc_name | biosample_id | |
|---|---|---|---|
| 0 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-00m15h97 |
| 1 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-06ta8e31 |
| 2 | O horizon | USA: Colorado, Rocky Mountains | nmdc:bsm-11-06tgpb52 |
| 3 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-0asn5d63 |
| 4 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-0djp2e45 |
| ... | ... | ... | ... |
| 513 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-zhrzwh12 |
| 514 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zhzner35 |
| 515 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zjsrkd21 |
| 516 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zk6h3328 |
| 517 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-znvc3c66 |
518 rows × 3 columns
Define a function to split a list into chunks¶
Since we will need to use a list of ids in filter to query a new collection in the API, we need to limit the number of ids we put in a filter. This function splits a list into chunks of 100. Note that the chunk_size has a default of 100, but can be adjusted.
# Define a function to split ids into chunks
def split_list(input_list, chunk_size=100):
result = []
for i in range(0, len(input_list), chunk_size):
result.append(input_list[i:i + chunk_size])
return result
Define a function to get a list of ids from initial results¶
This function creates a list of identifiers from the input list of responses from the nmdc_api_utilities functions. It use id_name key from the results to create a list of all the ids. The input is the initial result list and the name of the id field.
def get_id_list(result_list: list, id_name: str):
id_list = []
for item in result_list:
if type(item[id_name]) == str:
id_list.append(item[id_name])
elif type(item[id_name]) == list:
for another_item in item[id_name]:
id_list.append(another_item)
return id_list
2. Get all Pooling results where the Pooling has_input are the biosample ids¶
We want to query the material processing collection, so we create a MaterialProcessingSearch object. We use the get_record_by_filter function from this object get a list of all pooling results whose field, has_input are the biosample_ids we retrieved in step 1. As touched on earlier, we also want to ensure we are not passing too many ids into a filter at once, so we utilize the get_id_list and split_list functions to create chunks and iterate over. We will return the results for has_output as well and clean up the names so it is clear which collection the results are from. We also create a filter to query records where type is nmdc:Pooling. Finally, the pooling results are converted to a data frame.
from nmdc_api_utilities.material_processing_search import MaterialProcessingSearch
# create a MaterialProcessingSearch object
mp_client = MaterialProcessingSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# process the biosamples in chunks
result_ids = get_id_list(bs_results, "biosample_id")
chunked_list = split_list(result_ids)
pooling = []
for chunk in chunked_list:
# create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:Pooling", "has_input": {{"$in": {filter_list}}}}}'
# get the results
pooling += mp_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)
# clarify names/keys/identifiers
for pool in pooling:
pool["pooling_has_input"] = pool.pop("has_input")
pool["pooling_has_output"] = pool.pop("has_output")
pool["pooling_id"] = pool.pop("id")
pooling_df = dp_client.convert_to_df(pooling)
pooling_df
| pooling_has_input | pooling_has_output | pooling_id | |
|---|---|---|---|
| 0 | [nmdc:bsm-11-5228zz06, nmdc:bsm-11-1frj0t76, n... | [nmdc:procsm-11-49bwy122] | nmdc:poolp-11-a1nnyd94 |
| 1 | [nmdc:bsm-11-e0qcsb54, nmdc:bsm-11-3admsx52, n... | [nmdc:procsm-11-cnz65b78] | nmdc:poolp-11-gc19j338 |
| 2 | [nmdc:bsm-11-ex491068, nmdc:bsm-11-1byjjh32, n... | [nmdc:procsm-11-kngzyt90] | nmdc:poolp-11-sj9jpg87 |
| 3 | [nmdc:bsm-11-ehyv5z41, nmdc:bsm-11-48nzey88, n... | [nmdc:procsm-11-9th0yt69] | nmdc:poolp-11-rx280a54 |
| 4 | [nmdc:bsm-11-2744k638, nmdc:bsm-11-85vfjq03, n... | [nmdc:procsm-11-mdcbpc97] | nmdc:poolp-11-w8b7cv95 |
| ... | ... | ... | ... |
| 398 | [nmdc:bsm-11-znvc3c66, nmdc:bsm-11-wsr4vx16, n... | [nmdc:procsm-11-dvq1cx16] | nmdc:poolp-11-b13j8g68 |
| 399 | [nmdc:bsm-11-4k0jmb52, nmdc:bsm-11-ydtfff55, n... | [nmdc:procsm-11-mpcvhx03] | nmdc:poolp-11-1rp6ns28 |
| 400 | [nmdc:bsm-11-sgtk2z38, nmdc:bsm-11-xqtg8327, n... | [nmdc:procsm-11-f6kc8b10] | nmdc:poolp-11-ykrp9878 |
| 401 | [nmdc:bsm-11-yzpe6s26, nmdc:bsm-11-zfvcsy45, n... | [nmdc:procsm-11-wm0mqq15] | nmdc:poolp-11-bsnbr836 |
| 402 | [nmdc:bsm-11-zk6h3328, nmdc:bsm-11-kft4w435, n... | [nmdc:procsm-11-e015da88] | nmdc:poolp-11-658v9v07 |
403 rows × 3 columns
2.5 Merge biosample and pooling results¶
We utilize the DataProcessing object's merge_df function to merge the newly acquired pooling results with the original biosample results obtained from the package in step 1. We use the pooling_has_input and biosample_id from the two data frames as key names to merge on.
merged_df1 = dp_client.merge_df(pooling_df, biosample_df, "pooling_has_input", "biosample_id")
merged_df1
| pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | |
|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-5228zz06 |
| 1 | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-1frj0t76 |
| 2 | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-nyxsx333 |
| 3 | nmdc:bsm-11-e0qcsb54 | nmdc:procsm-11-cnz65b78 | nmdc:poolp-11-gc19j338 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-e0qcsb54 |
| 4 | nmdc:bsm-11-3admsx52 | nmdc:procsm-11-cnz65b78 | nmdc:poolp-11-gc19j338 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-3admsx52 |
| ... | ... | ... | ... | ... | ... | ... |
| 1087 | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xqtg8327 |
| 1088 | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-z5cmyh06 |
| 1107 | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-stjpwh75 |
| 1108 | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xngp2r34 |
| 1109 | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-yehx2807 |
505 rows × 6 columns
3. Get processed sample results where the the processed sample ids are the pooling_has_output field¶
Since we want to query the processed sample collection, we create a ProcessedSampleSearch object and use the get_record_by_filter function through this object. This provides a list of the processed sample results whose field, pooling_has_output are the processed sample ids. We will return the results only for the processed sample id field and clean up the names so it is clear that they are the identifiers from the processed_sample_set. Finally, the results are converted to a data frame.
from nmdc_api_utilities.processed_sample_search import ProcessedSampleSearch
# create a ProcessedSampleSearch object
ps_client = ProcessedSampleSearch(env=ENV)
# process the pooling in chunks
result_ids = get_id_list(pooling, "pooling_has_output")
chunked_list = split_list(result_ids)
process_set1 = []
for chunk in chunked_list:
# create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:ProcessedSample", "id": {{"$in": {filter_list}}}}}'
# get the results
process_set1 += ps_client.get_record_by_filter(filter=filter, fields="id", max_page_size=100, all_pages=True)
# clarify names
for processed_sample in process_set1:
processed_sample["processed_sample1"] = processed_sample.pop("id")
ps1_df = dp_client.convert_to_df(process_set1)
ps1_df
| processed_sample1 | |
|---|---|
| 0 | nmdc:procsm-11-1sr06083 |
| 1 | nmdc:procsm-11-258vbz70 |
| 2 | nmdc:procsm-11-2fxf0e98 |
| 3 | nmdc:procsm-11-2xvsb693 |
| 4 | nmdc:procsm-11-33n4p085 |
| ... | ... |
| 364 | nmdc:procsm-11-ztam2998 |
| 365 | nmdc:procsm-11-zw2k5d74 |
| 366 | nmdc:procsm-11-e015da88 |
| 367 | nmdc:procsm-11-f6kc8b10 |
| 368 | nmdc:procsm-11-wm0mqq15 |
369 rows × 1 columns
3.5 Merge processed sample results with the previously merged results¶
The merge_df function is used, once again, to merge the pooling and processed sample results on the pooling_has_output and processed_sample1 keys for the two data frames.
merged_df2 = dp_client.merge_df(merged_df1, ps1_df, "pooling_has_output", "processed_sample1")
merged_df2
| pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | |
|---|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 |
| 2 | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 |
| 4 | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 |
| 6 | nmdc:bsm-11-e0qcsb54 | nmdc:procsm-11-cnz65b78 | nmdc:poolp-11-gc19j338 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-e0qcsb54 | nmdc:procsm-11-cnz65b78 |
| 9 | nmdc:bsm-11-3admsx52 | nmdc:procsm-11-cnz65b78 | nmdc:poolp-11-gc19j338 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-3admsx52 | nmdc:procsm-11-cnz65b78 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1069 | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 |
| 1071 | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 |
| 1073 | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 |
| 1074 | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 |
| 1075 | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 |
505 rows × 7 columns
4 Get extraction results where processed_sample1 identifier is the has_input to the material_processing_set for Extractions¶
We want to query the material_processing_set, so we use the MaterialProcessing object created earlier, along with the get_record_by_filter function, again (you can see the pattern), but this time we filter where type is nmdc:Extraction, using the processed_sample1 identifier as the has_input for the collection. The names of the fields in the results are adjusted to make it clear which set the inputs, outputs, and ids are from.
# process the processed samples in chunks
result_ids = get_id_list(process_set1, "processed_sample1")
chunked_list = split_list(result_ids)
extraction_set = []
for chunk in chunked_list:
# create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:Extraction", "has_input": {{"$in": {filter_list}}}}}'
# get the results
extraction_set += mp_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)
# clarify names
for extraction in extraction_set:
extraction["extract_has_input"] = extraction.pop("has_input")
extraction["extract_has_output"] = extraction.pop("has_output")
extraction["extract_id"] = extraction.pop("id")
# convert to data frame
extract_df = dp_client.convert_to_df(extraction_set)
extract_df
| extract_has_input | extract_has_output | extract_id | |
|---|---|---|---|
| 0 | [nmdc:procsm-11-49bwy122] | [nmdc:procsm-11-kwaaah42] | nmdc:extrp-11-fsv8td81 |
| 1 | [nmdc:procsm-11-s61wwe09] | [nmdc:procsm-11-hnd2nm64] | nmdc:extrp-11-8q3xp262 |
| 2 | [nmdc:procsm-11-cnz65b78] | [nmdc:procsm-11-sxnqtz74] | nmdc:extrp-11-3334yj37 |
| 3 | [nmdc:procsm-11-kngzyt90] | [nmdc:procsm-11-h9s7h174] | nmdc:extrp-11-v25scb12 |
| 4 | [nmdc:procsm-11-fyx7js23] | [nmdc:procsm-11-4yevrf17] | nmdc:extrp-11-4frcnb65 |
| ... | ... | ... | ... |
| 349 | [nmdc:procsm-11-w13fqp71] | [nmdc:procsm-11-xbhs5x61] | nmdc:extrp-11-7km6zh80 |
| 350 | [nmdc:procsm-11-eecpt338] | [nmdc:procsm-11-gxvm5r54] | nmdc:extrp-11-73jns979 |
| 351 | [nmdc:procsm-11-kxs8m249] | [nmdc:procsm-11-kee8xv47] | nmdc:extrp-11-g1cazp42 |
| 352 | [nmdc:procsm-11-rbfspv43] | [nmdc:procsm-11-6fat7f34] | nmdc:extrp-11-b7kcx022 |
| 353 | [nmdc:procsm-11-e015da88] | [nmdc:procsm-11-878yka43] | nmdc:extrp-11-ganvz782 |
354 rows × 3 columns
4.5 Merge extraction results with the previously merged results¶
The extraction results obtained above are merged with the previously merged results (from step 3.5) using the processed_sample1 field in the previously merged data frame with the extract_has_input from the new extraction results.
merged_df3 = dp_client.merge_df(extract_df, merged_df2, "extract_has_input", "processed_sample1")
merged_df3
| extract_has_input | extract_has_output | extract_id | pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 |
| 1 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 |
| 2 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 |
| 3 | nmdc:procsm-11-s61wwe09 | nmdc:procsm-11-hnd2nm64 | nmdc:extrp-11-8q3xp262 | nmdc:bsm-11-pd429a61 | nmdc:procsm-11-s61wwe09 | nmdc:poolp-11-t5n1et05 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-pd429a61 | nmdc:procsm-11-s61wwe09 |
| 4 | nmdc:procsm-11-s61wwe09 | nmdc:procsm-11-hnd2nm64 | nmdc:extrp-11-8q3xp262 | nmdc:bsm-11-9yn2fq77 | nmdc:procsm-11-s61wwe09 | nmdc:poolp-11-t5n1et05 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-9yn2fq77 | nmdc:procsm-11-s61wwe09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 981 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 |
| 982 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 |
| 1001 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 |
| 1002 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 |
| 1003 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 |
505 rows × 10 columns
5 Get processed sample results from the output of the extraction results¶
We utilize the ProcessedSampleSearch object again, but this time using the extract_has_output ids to query the set. We only need to return the processed_sample_set identifiers.
# process the processed samples in chunks
result_ids = get_id_list(extraction_set, "extract_has_output")
chunked_list = split_list(result_ids)
process_set2 = []
for chunk in chunked_list:
# create the filter - query the processed_sample_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:ProcessedSample", "id": {{"$in": {filter_list}}}}}'
# get the results
process_set2 += ps_client.get_record_by_filter(filter=filter, fields="id", max_page_size=100, all_pages=True)
# clarify names
for samp in process_set2:
samp["processed_sample2"] = samp.pop("id")
# convert to data frame
ps2_df = dp_client.convert_to_df(process_set2)
ps2_df
| processed_sample2 | |
|---|---|
| 0 | nmdc:procsm-11-0qx90z87 |
| 1 | nmdc:procsm-11-0wxpzf07 |
| 2 | nmdc:procsm-11-1bzpzq15 |
| 3 | nmdc:procsm-11-1qfgdd16 |
| 4 | nmdc:procsm-11-1qgqxz62 |
| ... | ... |
| 343 | nmdc:procsm-11-xq1t3650 |
| 344 | nmdc:procsm-11-yav08109 |
| 345 | nmdc:procsm-11-ydtgc517 |
| 346 | nmdc:procsm-11-ze0gdq03 |
| 347 | nmdc:procsm-11-zr4x7712 |
348 rows × 1 columns
5.5 Merge the second processed_set results with the previous merged results¶
Using the merge_df function again, the processed_sample2 results are merged with the previously merged set (output of step 4.5) using the processed_sample2 identifiers and the extract_has_output identifiers from the merged set.
merged_df4 = dp_client.merge_df(merged_df3, ps2_df, "extract_has_output", "processed_sample2")
merged_df4
| extract_has_input | extract_has_output | extract_id | pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 |
| 1 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 |
| 2 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-nyxsx333 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 |
| 3 | nmdc:procsm-11-s61wwe09 | nmdc:procsm-11-hnd2nm64 | nmdc:extrp-11-8q3xp262 | nmdc:bsm-11-pd429a61 | nmdc:procsm-11-s61wwe09 | nmdc:poolp-11-t5n1et05 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-pd429a61 | nmdc:procsm-11-s61wwe09 | nmdc:procsm-11-hnd2nm64 |
| 5 | nmdc:procsm-11-s61wwe09 | nmdc:procsm-11-hnd2nm64 | nmdc:extrp-11-8q3xp262 | nmdc:bsm-11-9yn2fq77 | nmdc:procsm-11-s61wwe09 | nmdc:poolp-11-t5n1et05 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-9yn2fq77 | nmdc:procsm-11-s61wwe09 | nmdc:procsm-11-hnd2nm64 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1008 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 |
| 1009 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 |
| 1010 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 |
| 1011 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 |
| 1012 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 |
505 rows × 11 columns
6. Use MaterialProcessingSearch to get records where type is nmdc:LibraryPreparation¶
Using the processed_sample2 identifiers from the last query as the has_input to filter where type is nmdc:LibraryPreparation, we get a new batch of results, returning the library preparation identifiers, inputs and outputs. The field names are clarified to demonstrate they are from the MaterialProcessingSearch object where the type is nmdc:LibraryPreparation.
# process the material_processing in chunks
result_ids = get_id_list(process_set2, "processed_sample2")
chunked_list = split_list(result_ids)
library_prep_set = []
for chunk in chunked_list:
# create the filter - query the material_processing_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:LibraryPreparation", "has_input": {{"$in": {filter_list}}}}}'
# get the results
library_prep_set += mp_client.get_record_by_filter(filter=filter, fields="id,has_input,has_output", max_page_size=100, all_pages=True)
# clarify names
for prep in library_prep_set:
prep["lp_has_input"] = prep.pop("has_input")
prep["lp_has_output"] = prep.pop("has_output")
prep["lp_id"] = prep.pop("id")
# convert to data frame
lp_df = dp_client.convert_to_df(library_prep_set)
lp_df
| lp_has_input | lp_has_output | lp_id | |
|---|---|---|---|
| 0 | [nmdc:procsm-11-nay11727] | [nmdc:procsm-11-9pdkj890] | nmdc:libprp-11-2tnjjj55 |
| 1 | [nmdc:procsm-11-kwaaah42] | [nmdc:procsm-11-kfkbxp22] | nmdc:libprp-11-k5j44e20 |
| 2 | [nmdc:procsm-11-hnd2nm64] | [nmdc:procsm-11-as6w8f18] | nmdc:libprp-11-h2hy8z17 |
| 3 | [nmdc:procsm-11-7qy2y664] | [nmdc:procsm-11-wd4s5f38] | nmdc:libprp-11-wv6p0032 |
| 4 | [nmdc:procsm-11-sxnqtz74] | [nmdc:procsm-11-f06scg15] | nmdc:libprp-11-ctwynj07 |
| ... | ... | ... | ... |
| 340 | [nmdc:procsm-11-kpw8j244] | [nmdc:procsm-11-1v407908] | nmdc:libprp-11-8ra09y76 |
| 341 | [nmdc:procsm-11-zr4x7712] | [nmdc:procsm-11-vhfb5c18] | nmdc:libprp-11-12ph5n93 |
| 342 | [nmdc:procsm-11-kee8xv47] | [nmdc:procsm-11-1eg4r286] | nmdc:libprp-11-x8nqhq06 |
| 343 | [nmdc:procsm-11-6fat7f34] | [nmdc:procsm-11-gm915e24] | nmdc:libprp-11-874cdm88 |
| 344 | [nmdc:procsm-11-878yka43] | [nmdc:procsm-11-t66cxk50] | nmdc:libprp-11-4g7pfm95 |
345 rows × 3 columns
6.5 Merge library preparation results with previously merged results¶
The library preparation results are merged with the previous results (from step 5.5) using the lp_has_input and the processed_sample2 fields.
merged_df5 = dp_client.merge_df(lp_df, merged_df4, "lp_has_input", "processed_sample2")
merged_df5
| lp_has_input | lp_has_output | lp_id | extract_has_input | extract_has_output | extract_id | pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 | nmdc:libprp-11-2tnjjj55 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:extrp-11-574dws05 | nmdc:bsm-11-w43vsm21 | nmdc:procsm-11-8ec7zx31 | nmdc:poolp-11-0ak13p40 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-w43vsm21 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 |
| 1 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 | nmdc:libprp-11-2tnjjj55 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:extrp-11-574dws05 | nmdc:bsm-11-dbavm335 | nmdc:procsm-11-8ec7zx31 | nmdc:poolp-11-0ak13p40 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-dbavm335 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 |
| 2 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 | nmdc:libprp-11-2tnjjj55 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:extrp-11-574dws05 | nmdc:bsm-11-4c6er508 | nmdc:procsm-11-8ec7zx31 | nmdc:poolp-11-0ak13p40 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-4c6er508 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 |
| 3 | nmdc:procsm-11-kwaaah42 | nmdc:procsm-11-kfkbxp22 | nmdc:libprp-11-k5j44e20 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 |
| 4 | nmdc:procsm-11-kwaaah42 | nmdc:procsm-11-kfkbxp22 | nmdc:libprp-11-k5j44e20 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 961 | nmdc:procsm-11-64ksxw87 | nmdc:procsm-11-bqe26091 | nmdc:libprp-11-4ebzbm49 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 |
| 962 | nmdc:procsm-11-64ksxw87 | nmdc:procsm-11-bqe26091 | nmdc:libprp-11-4ebzbm49 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 |
| 972 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 | nmdc:libprp-11-rz4mr176 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 |
| 973 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 | nmdc:libprp-11-rz4mr176 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 |
| 974 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 | nmdc:libprp-11-rz4mr176 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 |
505 rows × 14 columns
7 Get third set of proccessed samples from the library preparation output¶
For a third, and last time, we use the ProcessedSampleSearch object, creating the filter using the lp_has_output identifiers. We only return the id field (as processed_sample3)
# process the processed_sample_set in chunks
result_ids = get_id_list(library_prep_set, "lp_has_output")
chunked_list = split_list(result_ids)
process_set3 = []
for chunk in chunked_list:
# create the filter - query the processed_sample_set collection looking for data objects that have the biosample_id in the has_input field and are of type nmdc:Pooling
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:ProcessedSample", "id": {{"$in": {filter_list}}}}}'
# get the results
process_set3 += ps_client.get_record_by_filter(filter=filter, fields="id", max_page_size=100, all_pages=True)
# clarify keys
for samp in process_set3:
samp["processed_sample3"] = samp.pop("id")
# convert to data frame
ps3_df = dp_client.convert_to_df(process_set3)
ps3_df
| processed_sample3 | |
|---|---|
| 0 | nmdc:procsm-11-01k85106 |
| 1 | nmdc:procsm-11-0tkf2q02 |
| 2 | nmdc:procsm-11-12hw2r66 |
| 3 | nmdc:procsm-11-1kf9fn36 |
| 4 | nmdc:procsm-11-1v407908 |
| ... | ... |
| 336 | nmdc:procsm-11-x27qy119 |
| 337 | nmdc:procsm-11-xbva4x23 |
| 338 | nmdc:procsm-11-yqtwwk98 |
| 339 | nmdc:procsm-11-za57ra10 |
| 340 | nmdc:procsm-11-zqw3wv67 |
341 rows × 1 columns
7.5 Merge the third batch of processed samples with the merged data frame¶
The last batch of processed samples are merged with the previously merged data frame (output of step 6.5) using the lp_has_output field and the processed_sample3 field.
merged_df6 = dp_client.merge_df(merged_df5, ps3_df, "lp_has_output", "processed_sample3")
merged_df6
| lp_has_input | lp_has_output | lp_id | extract_has_input | extract_has_output | extract_id | pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | processed_sample3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 | nmdc:libprp-11-2tnjjj55 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:extrp-11-574dws05 | nmdc:bsm-11-w43vsm21 | nmdc:procsm-11-8ec7zx31 | nmdc:poolp-11-0ak13p40 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-w43vsm21 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 |
| 2 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 | nmdc:libprp-11-2tnjjj55 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:extrp-11-574dws05 | nmdc:bsm-11-dbavm335 | nmdc:procsm-11-8ec7zx31 | nmdc:poolp-11-0ak13p40 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-dbavm335 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 |
| 4 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 | nmdc:libprp-11-2tnjjj55 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:extrp-11-574dws05 | nmdc:bsm-11-4c6er508 | nmdc:procsm-11-8ec7zx31 | nmdc:poolp-11-0ak13p40 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-4c6er508 | nmdc:procsm-11-8ec7zx31 | nmdc:procsm-11-nay11727 | nmdc:procsm-11-9pdkj890 |
| 6 | nmdc:procsm-11-kwaaah42 | nmdc:procsm-11-kfkbxp22 | nmdc:libprp-11-k5j44e20 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-5228zz06 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:procsm-11-kfkbxp22 |
| 7 | nmdc:procsm-11-kwaaah42 | nmdc:procsm-11-kfkbxp22 | nmdc:libprp-11-k5j44e20 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:extrp-11-fsv8td81 | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:poolp-11-a1nnyd94 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-1frj0t76 | nmdc:procsm-11-49bwy122 | nmdc:procsm-11-kwaaah42 | nmdc:procsm-11-kfkbxp22 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 988 | nmdc:procsm-11-64ksxw87 | nmdc:procsm-11-bqe26091 | nmdc:libprp-11-4ebzbm49 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xqtg8327 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:procsm-11-bqe26091 |
| 989 | nmdc:procsm-11-64ksxw87 | nmdc:procsm-11-bqe26091 | nmdc:libprp-11-4ebzbm49 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:extrp-11-vg3vzm96 | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:poolp-11-ykrp9878 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-z5cmyh06 | nmdc:procsm-11-f6kc8b10 | nmdc:procsm-11-64ksxw87 | nmdc:procsm-11-bqe26091 |
| 990 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 | nmdc:libprp-11-rz4mr176 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-stjpwh75 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 |
| 991 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 | nmdc:libprp-11-rz4mr176 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-xngp2r34 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 |
| 992 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 | nmdc:libprp-11-rz4mr176 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:extrp-11-k86nz804 | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:poolp-11-57e94274 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-yehx2807 | nmdc:procsm-11-dkr9k079 | nmdc:procsm-11-s8m02r47 | nmdc:procsm-11-68j9y310 |
505 rows × 15 columns
8 Get data_generation results from the processed sample identifiers¶
Using the third batch of processed sample identifiers, we create a DataGenerationSearch object to utilize the get_record_by_filter function. The filter is built to query on the has_input field. The id and has_input field names are changed to specify that they came from the DataGenerationSearch object.
from nmdc_api_utilities.data_generation_search import DataGenerationSearch
# create a DataGenerationSearch object
dg_client = DataGenerationSearch(env=ENV)
result_ids = get_id_list(process_set3, "processed_sample3")
chunked_list = split_list(result_ids)
data_generation_set = []
for chunk in chunked_list:
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:NucleotideSequencing", "has_input": {{"$in": {filter_list}}}}}'
# get the results
data_generation_set += dg_client.get_record_by_filter(filter=filter, fields="has_input,id", max_page_size=100, all_pages=True)
# clarify keys
for dg in data_generation_set:
dg["dg_has_input"] = dg.pop("has_input")
dg["dg_id"] = dg.pop("id")
# convert to data frame
dg_df = dp_client.convert_to_df(data_generation_set)
dg_df
| dg_has_input | dg_id | |
|---|---|---|
| 0 | [nmdc:procsm-11-01k85106] | nmdc:dgns-11-wxbab669 |
| 1 | [nmdc:procsm-11-0tkf2q02] | nmdc:omprc-11-2mw7h339 |
| 2 | [nmdc:procsm-11-12hw2r66] | nmdc:dgns-11-8amfa663 |
| 3 | [nmdc:procsm-11-1kf9fn36] | nmdc:dgns-11-see99855 |
| 4 | [nmdc:procsm-11-1v407908] | nmdc:dgns-11-syr2vn62 |
| ... | ... | ... |
| 321 | [nmdc:procsm-11-x27qy119] | nmdc:omprc-11-hv686d67 |
| 322 | [nmdc:procsm-11-xbva4x23] | nmdc:omprc-11-sz2d4412 |
| 323 | [nmdc:procsm-11-yqtwwk98] | nmdc:omprc-11-2g0n6985 |
| 324 | [nmdc:procsm-11-za57ra10] | nmdc:omprc-11-z00cqd70 |
| 325 | [nmdc:procsm-11-zqw3wv67] | nmdc:dgns-11-48ydpp30 |
326 rows × 2 columns
8.5 Merge the data_generation_set with the rest of the results¶
The results from querying data generation above are merged with the previously merged results (from step 7.5) using the dg_has_input field and the processed_sample3 field to match on.
merged_df7 = dp_client.merge_df(dg_df, merged_df6, "dg_has_input", "processed_sample3")
merged_df7
| dg_has_input | dg_id | lp_has_input | lp_has_output | lp_id | extract_has_input | extract_has_output | extract_id | pooling_has_input | pooling_has_output | pooling_id | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | processed_sample3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:poolp-11-4ssz6p14 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 |
| 1 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-26khva17 | nmdc:procsm-11-s9bpqf04 | nmdc:poolp-11-4ssz6p14 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-26khva17 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 |
| 2 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-pf1j8598 | nmdc:procsm-11-s9bpqf04 | nmdc:poolp-11-4ssz6p14 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-pf1j8598 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 |
| 3 | nmdc:procsm-11-0tkf2q02 | nmdc:omprc-11-2mw7h339 | nmdc:procsm-11-2z8s8m53 | nmdc:procsm-11-0tkf2q02 | nmdc:libprp-11-ypebxj92 | nmdc:procsm-11-7ppgpt30 | nmdc:procsm-11-2z8s8m53 | nmdc:extrp-11-bspys917 | nmdc:bsm-11-geecaz29 | nmdc:procsm-11-7ppgpt30 | nmdc:poolp-11-t7y1gd11 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-geecaz29 | nmdc:procsm-11-7ppgpt30 | nmdc:procsm-11-2z8s8m53 | nmdc:procsm-11-0tkf2q02 |
| 4 | nmdc:procsm-11-0tkf2q02 | nmdc:omprc-11-2mw7h339 | nmdc:procsm-11-2z8s8m53 | nmdc:procsm-11-0tkf2q02 | nmdc:libprp-11-ypebxj92 | nmdc:procsm-11-7ppgpt30 | nmdc:procsm-11-2z8s8m53 | nmdc:extrp-11-bspys917 | nmdc:bsm-11-bnf1p650 | nmdc:procsm-11-7ppgpt30 | nmdc:poolp-11-t7y1gd11 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-bnf1p650 | nmdc:procsm-11-7ppgpt30 | nmdc:procsm-11-2z8s8m53 | nmdc:procsm-11-0tkf2q02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 878 | nmdc:procsm-11-h57z5224 | nmdc:omprc-11-pv9a5n07 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 | nmdc:libprp-11-c9v99696 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | nmdc:extrp-11-3ksezw64 | nmdc:bsm-11-sj3j0662 | nmdc:procsm-11-ez4jz447 | nmdc:poolp-11-casd3207 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-sj3j0662 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 |
| 879 | nmdc:procsm-11-h57z5224 | nmdc:omprc-11-pv9a5n07 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 | nmdc:libprp-11-c9v99696 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | nmdc:extrp-11-3ksezw64 | nmdc:bsm-11-y2x26s57 | nmdc:procsm-11-ez4jz447 | nmdc:poolp-11-casd3207 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-y2x26s57 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 |
| 928 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-td9fm715 | nmdc:procsm-11-vv9mr730 | nmdc:poolp-11-zgzjjj76 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-td9fm715 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 |
| 929 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-tdrast31 | nmdc:procsm-11-vv9mr730 | nmdc:poolp-11-zgzjjj76 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-tdrast31 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 |
| 930 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:poolp-11-zgzjjj76 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 |
490 rows × 17 columns
9 Get the metagenome_annotation_set using the data generation identifiers¶
We create a WorkflowExecutionSearch object to query workflow_execution_set. We create a filter using the identifiers obtained from the data generation to match with the was_informed_by field and setting the type field to nmdc:MetagenomeAnnotation. Field names are clarified, once again to specify the collection they came from.
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
# create a WorkflowExecutionSearch object
we_client = WorkflowExecutionSearch(env=ENV)
result_ids = get_id_list(data_generation_set, "dg_id")
chunked_list = split_list(result_ids)
meta_act_ann_set = []
for chunk in chunked_list:
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:MetagenomeAnnotation", "was_informed_by": {{"$in": {filter_list}}}}}'
# get the results
meta_act_ann_set += we_client.get_record_by_filter(filter=filter, fields="has_output,was_informed_by,id,version", max_page_size=100, all_pages=True)
# clarify names
for mga in meta_act_ann_set:
mga["mga_id"] = mga.pop("id")
mga["mga_version"] = mga.pop("version")
mga["mga_was_informed_by"] = mga.pop("was_informed_by")
mga["mga_has_output"] = mga.pop("has_output")
# convert to data frame
mga_df = dp_client.convert_to_df(meta_act_ann_set)
mga_df
| mga_id | mga_version | mga_was_informed_by | mga_has_output | |
|---|---|---|---|---|
| 0 | nmdc:wfmgan-11-05cdqw41.1 | v1.1.5 | [nmdc:dgns-11-ekte1238] | [nmdc:dobj-11-ndsyd761, nmdc:dobj-11-ss1k0e30,... |
| 1 | nmdc:wfmgan-11-0nwd1388.1 | v1.0.4 | [nmdc:omprc-11-2937gz63] | [nmdc:dobj-11-vpaxc956, nmdc:dobj-11-ad42v813,... |
| 2 | nmdc:wfmgan-11-0nwd1388.2 | v1.0.5 | [nmdc:omprc-11-2937gz63] | [nmdc:dobj-11-tdjwam92, nmdc:dobj-11-d55djd72,... |
| 3 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | [nmdc:omprc-11-th7v6711] | [nmdc:dobj-11-yfvgh831, nmdc:dobj-11-0ykk7811,... |
| 4 | nmdc:wfmgan-11-14gcar54.1 | v1.0.4 | [nmdc:omprc-11-px5df021] | [nmdc:dobj-11-dy2jsc18, nmdc:dobj-11-3amwd664,... |
| ... | ... | ... | ... | ... |
| 345 | nmdc:wfmgan-11-aqymgy87.1 | v1.0.4 | [nmdc:omprc-11-t0espr14] | [nmdc:dobj-11-nr50z869, nmdc:dobj-11-6g3ka432,... |
| 346 | nmdc:wfmgan-11-aqymgy87.2 | v1.1.0 | [nmdc:omprc-11-t0espr14] | [nmdc:dobj-11-6284dm29, nmdc:dobj-11-dhrq5t83,... |
| 347 | nmdc:wfmgan-11-jddrcn33.1 | v1.0.4 | [nmdc:omprc-11-t5v1jk63] | [nmdc:dobj-11-jpjc0875, nmdc:dobj-11-9qphny43,... |
| 348 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | [nmdc:omprc-11-th7v6711] | [nmdc:dobj-11-yfvgh831, nmdc:dobj-11-0ykk7811,... |
| 349 | nmdc:wfmgan-11-gyg96p67.1 | v1.0.4 | [nmdc:omprc-11-z00cqd70] | [nmdc:dobj-11-xrb37t71, nmdc:dobj-11-wfdvxf91,... |
350 rows × 4 columns
9.5 Merge metagenome activity results with the previously merged results¶
The metagenome activity results obtained above are merged with the previously combined results (from step 8.5), matching on the dg_id and mga_was_informed_by fields.
merged_df8 = dp_client.merge_df(merged_df7, mga_df, "dg_id", "mga_was_informed_by")
merged_df8
| dg_has_input | dg_id | lp_has_input | lp_has_output | lp_id | extract_has_input | extract_has_output | extract_id | pooling_has_input | pooling_has_output | ... | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | processed_sample3 | mga_id | mga_version | mga_was_informed_by | mga_has_output | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:wfmgan-11-ne5fap84.1 | v1.1.5 | nmdc:dgns-11-wxbab669 | nmdc:dobj-11-5p1xav61 |
| 1 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:wfmgan-11-ne5fap84.1 | v1.1.5 | nmdc:dgns-11-wxbab669 | nmdc:dobj-11-kcg9tc71 |
| 2 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:wfmgan-11-ne5fap84.1 | v1.1.5 | nmdc:dgns-11-wxbab669 | nmdc:dobj-11-xtqnaj25 |
| 3 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:wfmgan-11-ne5fap84.1 | v1.1.5 | nmdc:dgns-11-wxbab669 | nmdc:dobj-11-v8k16016 |
| 4 | nmdc:procsm-11-01k85106 | nmdc:dgns-11-wxbab669 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:libprp-11-sqmba015 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:extrp-11-wewd5f59 | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-9v0epr64 | nmdc:procsm-11-s9bpqf04 | nmdc:procsm-11-m6gcps44 | nmdc:procsm-11-01k85106 | nmdc:wfmgan-11-ne5fap84.1 | v1.1.5 | nmdc:dgns-11-wxbab669 | nmdc:dobj-11-m2v1va20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24153 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-mhp3k924 |
| 24154 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-retvmq75 |
| 24155 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-aw5x5j27 |
| 24156 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-1fq0xm75 |
| 24157 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:extrp-11-6rjaph92 | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-6fb6r674 |
13009 rows × 21 columns
10 Get data objects from the metagenome activity result outputs¶
We create a DataObjectSearch object to utilize get_record_by_filter. We create a filter to query the mga_has_output identifiers to match the id field in the data objects. Since this is the final query, the filter parameter is slightly different than the rest of the queries. We specify that that we need to retrieve all results where the data_object_type has a value of Scaffold Lineage tsv (since this has contig taxonomy results) given the list of identifiers. Note that the url is a new field returned that contains the tsvs we will need for the final analysis.
from nmdc_api_utilities.data_object_search import DataObjectSearch
# create a DataObjectSearch object
do_client = DataObjectSearch(env=ENV)
result_ids = get_id_list(meta_act_ann_set, "mga_has_output")
chunked_list = split_list(result_ids)
data_ob_set = []
for chunk in chunked_list:
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"type": "nmdc:DataObject", "data_object_type": "Scaffold Lineage tsv", "id": {{"$in": {filter_list}}}}}'
# get the results
data_ob_set += do_client.get_record_by_filter(filter=filter, fields="id,data_object_type,url", max_page_size=100, all_pages=True)
# clarify fields
for ob in data_ob_set:
ob["data_ob_id"] = ob.pop("id")
# convert to data frame
do_df = dp_client.convert_to_df(data_ob_set)
do_df
| data_object_type | url | data_ob_id | |
|---|---|---|---|
| 0 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 |
| 1 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-8sttbc64 |
| 2 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-g7xsfb88 |
| 3 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-q3de6z81 |
| 4 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | nmdc:dobj-11-ykw2tv02 |
| ... | ... | ... | ... |
| 345 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-g87j5y46 |
| 346 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-nzmgqh66 |
| 347 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-ven5zv88 |
| 348 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 |
| 349 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-8mc4sk45 |
350 rows × 3 columns
10.5 Merge one last time¶
For the final merge, we merge the data object results obtained above with the rest of our combined results, matching the data_ob_id key with the mga_has_output key.
merged_df9 = dp_client.merge_df(do_df, merged_df8, "data_ob_id", "mga_has_output")
merged_df9
| data_object_type | url | data_ob_id | dg_has_input | dg_id | lp_has_input | lp_has_output | lp_id | extract_has_input | extract_has_output | ... | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | processed_sample3 | mga_id | mga_version | mga_was_informed_by | mga_has_output | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 | nmdc:procsm-11-ngwm7252 | nmdc:omprc-11-th7v6711 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:libprp-11-nyvvd758 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-k0qje589 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | nmdc:omprc-11-th7v6711 | nmdc:dobj-11-1apwza69 |
| 1 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 | nmdc:procsm-11-ngwm7252 | nmdc:omprc-11-th7v6711 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:libprp-11-nyvvd758 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-7fkbrp42 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | nmdc:omprc-11-th7v6711 | nmdc:dobj-11-1apwza69 |
| 2 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 | nmdc:procsm-11-ngwm7252 | nmdc:omprc-11-th7v6711 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:libprp-11-nyvvd758 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-wpzp9996 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | nmdc:omprc-11-th7v6711 | nmdc:dobj-11-1apwza69 |
| 3 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-8sttbc64 | nmdc:procsm-11-3s5m9a70 | nmdc:omprc-11-2937gz63 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:libprp-11-a6yw0y51 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | ... | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-m6r77j31 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:wfmgan-11-0nwd1388.1 | v1.0.4 | nmdc:omprc-11-2937gz63 | nmdc:dobj-11-8sttbc64 |
| 4 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-8sttbc64 | nmdc:procsm-11-3s5m9a70 | nmdc:omprc-11-2937gz63 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:libprp-11-a6yw0y51 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | ... | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-1hkmx038 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:wfmgan-11-0nwd1388.1 | v1.0.4 | nmdc:omprc-11-2937gz63 | nmdc:dobj-11-8sttbc64 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 986 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-a1v78a54 | nmdc:procsm-11-h57z5224 | nmdc:omprc-11-pv9a5n07 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 | nmdc:libprp-11-c9v99696 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-sj3j0662 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 | nmdc:wfmgan-11-vqpdns38.1 | v1.1.5 | nmdc:omprc-11-pv9a5n07 | nmdc:dobj-11-a1v78a54 |
| 987 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-a1v78a54 | nmdc:procsm-11-h57z5224 | nmdc:omprc-11-pv9a5n07 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 | nmdc:libprp-11-c9v99696 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-y2x26s57 | nmdc:procsm-11-ez4jz447 | nmdc:procsm-11-wzg4jt58 | nmdc:procsm-11-h57z5224 | nmdc:wfmgan-11-vqpdns38.1 | v1.1.5 | nmdc:omprc-11-pv9a5n07 | nmdc:dobj-11-a1v78a54 |
| 991 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-vvxc2g29 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-td9fm715 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-vvxc2g29 |
| 992 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-vvxc2g29 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-tdrast31 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-vvxc2g29 |
| 993 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-vvxc2g29 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-vvxc2g29 |
545 rows × 24 columns
Clean up the combined results¶
Select a workflow version of these results to use in this analysis
versioned_df10 = merged_df9[merged_df9['mga_version'].str.contains('v1.0.4',na=False)]
versioned_df10
| data_object_type | url | data_ob_id | dg_has_input | dg_id | lp_has_input | lp_has_output | lp_id | extract_has_input | extract_has_output | ... | soil_horizon | geo_loc_name | biosample_id | processed_sample1 | processed_sample2 | processed_sample3 | mga_id | mga_version | mga_was_informed_by | mga_has_output | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 | nmdc:procsm-11-ngwm7252 | nmdc:omprc-11-th7v6711 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:libprp-11-nyvvd758 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-k0qje589 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | nmdc:omprc-11-th7v6711 | nmdc:dobj-11-1apwza69 |
| 1 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 | nmdc:procsm-11-ngwm7252 | nmdc:omprc-11-th7v6711 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:libprp-11-nyvvd758 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-7fkbrp42 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | nmdc:omprc-11-th7v6711 | nmdc:dobj-11-1apwza69 |
| 2 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-1apwza69 | nmdc:procsm-11-ngwm7252 | nmdc:omprc-11-th7v6711 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:libprp-11-nyvvd758 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | ... | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-wpzp9996 | nmdc:procsm-11-g4jv1f71 | nmdc:procsm-11-kjvc3e42 | nmdc:procsm-11-ngwm7252 | nmdc:wfmgan-11-0r142238.1 | v1.0.4 | nmdc:omprc-11-th7v6711 | nmdc:dobj-11-1apwza69 |
| 3 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-8sttbc64 | nmdc:procsm-11-3s5m9a70 | nmdc:omprc-11-2937gz63 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:libprp-11-a6yw0y51 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | ... | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-m6r77j31 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:wfmgan-11-0nwd1388.1 | v1.0.4 | nmdc:omprc-11-2937gz63 | nmdc:dobj-11-8sttbc64 |
| 4 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-8sttbc64 | nmdc:procsm-11-3s5m9a70 | nmdc:omprc-11-2937gz63 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:libprp-11-a6yw0y51 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | ... | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-1hkmx038 | nmdc:procsm-11-w5zzjm84 | nmdc:procsm-11-yz8wab55 | nmdc:procsm-11-3s5m9a70 | nmdc:wfmgan-11-0nwd1388.1 | v1.0.4 | nmdc:omprc-11-2937gz63 | nmdc:dobj-11-8sttbc64 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 908 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-wkznj445 | nmdc:procsm-11-yqtwwk98 | nmdc:omprc-11-2g0n6985 | nmdc:procsm-11-ze0gdq03 | nmdc:procsm-11-yqtwwk98 | nmdc:libprp-11-2mjwz291 | nmdc:procsm-11-86beb994 | nmdc:procsm-11-ze0gdq03 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-tpzf1a43 | nmdc:procsm-11-86beb994 | nmdc:procsm-11-ze0gdq03 | nmdc:procsm-11-yqtwwk98 | nmdc:wfmgan-11-wpvhfk84.1 | v1.0.4 | nmdc:omprc-11-2g0n6985 | nmdc:dobj-11-wkznj445 |
| 909 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-wkznj445 | nmdc:procsm-11-yqtwwk98 | nmdc:omprc-11-2g0n6985 | nmdc:procsm-11-ze0gdq03 | nmdc:procsm-11-yqtwwk98 | nmdc:libprp-11-2mjwz291 | nmdc:procsm-11-86beb994 | nmdc:procsm-11-ze0gdq03 | ... | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-njy52033 | nmdc:procsm-11-86beb994 | nmdc:procsm-11-ze0gdq03 | nmdc:procsm-11-yqtwwk98 | nmdc:wfmgan-11-wpvhfk84.1 | v1.0.4 | nmdc:omprc-11-2g0n6985 | nmdc:dobj-11-wkznj445 |
| 991 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-vvxc2g29 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-td9fm715 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-vvxc2g29 |
| 992 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-vvxc2g29 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-tdrast31 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-vvxc2g29 |
| 993 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:dobj-11-vvxc2g29 | nmdc:procsm-11-w7vnsk07 | nmdc:omprc-11-r9wnp831 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:libprp-11-8z0dcm53 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | ... | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zaccf569 | nmdc:procsm-11-vv9mr730 | nmdc:procsm-11-vvgh6z28 | nmdc:procsm-11-w7vnsk07 | nmdc:wfmgan-11-e3s88g45.1 | v1.0.4 | nmdc:omprc-11-r9wnp831 | nmdc:dobj-11-vvxc2g29 |
241 rows × 24 columns
In the final step of retrieving and cleaning the data, we clean up the final merged data frame by removing all of the "joining columns" that are not needed in our final analysis. This included most of the identifier columns including biosample_id to avoid redundant downloads when multiple biosamples fed to the same processed result. The only columns we retain are the soil_horizon, geo_loc_name, data_ob_id, and the url to the tsv. The final_df is displayed.
column_list = versioned_df10.columns.tolist()
columns_to_keep = ["soil_horizon", "url", "geo_loc_name", "data_ob_id"]
columns_to_remove = list(set(column_list).difference(columns_to_keep))
# Drop unnecessary rows
df10_cleaned = versioned_df10.drop(columns=columns_to_remove)
# remove duplicates
df10_cleaned.drop_duplicates(keep="first", inplace=True)
# check rows when we reagrregate/implode,
final_df = df10_cleaned.groupby(["soil_horizon", "geo_loc_name", "data_ob_id"]).agg({"url": list}).reset_index()
final_df
| soil_horizon | geo_loc_name | data_ob_id | url | |
|---|---|---|---|---|
| 0 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-1apwza69 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 1 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-1nkd9110 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 2 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-25ndys70 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 3 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-5apjp861 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 4 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-63rwtd28 | [https://data.microbiomedata.org/data/nmdc:omp... |
| ... | ... | ... | ... | ... |
| 79 | O horizon | USA: Colorado, Niwot Ridge | nmdc:dobj-11-wkznj445 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 80 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-jp45gr33 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 81 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-n7hhax28 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 82 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-s7dphe48 | [https://data.microbiomedata.org/data/nmdc:omp... |
| 83 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-z73a1f14 | [https://data.microbiomedata.org/data/nmdc:omp... |
84 rows × 4 columns
Change the url column from a list to a string¶
In order to open the tsv urls, the structure of the url column will need to be changed from a list to a string in order to properly open the tsvs.
final_df["url"] = final_df["url"].apply(lambda x: ', '.join(map(str, x)))
final_df
| soil_horizon | geo_loc_name | data_ob_id | url | |
|---|---|---|---|---|
| 0 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-1apwza69 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 1 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-1nkd9110 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 2 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-25ndys70 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 3 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-5apjp861 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 4 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-63rwtd28 | https://data.microbiomedata.org/data/nmdc:ompr... |
| ... | ... | ... | ... | ... |
| 79 | O horizon | USA: Colorado, Niwot Ridge | nmdc:dobj-11-wkznj445 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 80 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-jp45gr33 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 81 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-n7hhax28 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 82 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-s7dphe48 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 83 | O horizon | USA: Colorado, Rocky Mountains | nmdc:dobj-11-z73a1f14 | https://data.microbiomedata.org/data/nmdc:ompr... |
84 rows × 4 columns
Show how many results have M horizon vs. O horizon¶
The soil_horizon column can be counted using the value_counts() functionality. There are many more M horizon samples than O horizon.
# Show unique soil horizons:
soil_horizons = final_df['soil_horizon'].value_counts()
print(soil_horizons)
soil_horizon M horizon 68 O horizon 16 Name: count, dtype: int64
Randomly select a subset of these datasets for which to pull information¶
# randomly select 15 data sets in each horizon
n = 15
#list the different types
list_type=soil_horizons.index.tolist()
#for each type, randomly horizon n data sets and save them into list
random_subset=[]
for type in list_type:
#each data object ID and horizon type
sample_type=final_df[['data_ob_id','soil_horizon']].drop_duplicates()
#filter to current horizon type
sample_type=sample_type[sample_type['soil_horizon']==type]
#randomly horizon n data object IDs in current horizon type
sample_type=sample_type.sample(n=n, random_state=2)
#save
random_subset.append(sample_type)
#resave list as dataframe
random_subset=pd.concat(random_subset).reset_index(drop=True)
#remerge rest of the data for the sampled data sets
final_df=random_subset.merge(final_df,on=['data_ob_id','soil_horizon'],how="left")
final_df
| data_ob_id | soil_horizon | geo_loc_name | url | |
|---|---|---|---|---|
| 0 | nmdc:dobj-11-zk896n23 | M horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 1 | nmdc:dobj-11-c1dvnq21 | M horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 2 | nmdc:dobj-11-8mc4sk45 | M horizon | USA: Colorado, North Sterling | https://data.microbiomedata.org/data/nmdc:ompr... |
| 3 | nmdc:dobj-11-k1vrrb83 | M horizon | USA: Colorado, North Sterling | https://data.microbiomedata.org/data/nmdc:ompr... |
| 4 | nmdc:dobj-11-q96s7s63 | M horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 5 | nmdc:dobj-11-ngtdmd88 | M horizon | USA: Colorado, Rocky Mountains | https://data.microbiomedata.org/data/nmdc:ompr... |
| 6 | nmdc:dobj-11-r371w335 | M horizon | USA: Colorado, North Sterling | https://data.microbiomedata.org/data/nmdc:ompr... |
| 7 | nmdc:dobj-11-1apwza69 | M horizon | USA: Colorado, Central Plains Experimental Range | https://data.microbiomedata.org/data/nmdc:ompr... |
| 8 | nmdc:dobj-11-yebxx995 | M horizon | USA: Colorado, North Sterling | https://data.microbiomedata.org/data/nmdc:ompr... |
| 9 | nmdc:dobj-11-605rmv44 | M horizon | USA: Colorado, North Sterling | https://data.microbiomedata.org/data/nmdc:ompr... |
| 10 | nmdc:dobj-11-f1pg9z40 | M horizon | USA: Colorado, Central Plains Experimental Range | https://data.microbiomedata.org/data/nmdc:ompr... |
| 11 | nmdc:dobj-11-nzmgqh66 | M horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 12 | nmdc:dobj-11-xptdf353 | M horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 13 | nmdc:dobj-11-1nkd9110 | M horizon | USA: Colorado, Central Plains Experimental Range | https://data.microbiomedata.org/data/nmdc:ompr... |
| 14 | nmdc:dobj-11-1mwrks28 | M horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 15 | nmdc:dobj-11-jp45gr33 | O horizon | USA: Colorado, Rocky Mountains | https://data.microbiomedata.org/data/nmdc:ompr... |
| 16 | nmdc:dobj-11-ht5msd46 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 17 | nmdc:dobj-11-nem7e417 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 18 | nmdc:dobj-11-8ybd1f87 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 19 | nmdc:dobj-11-v1d0fe44 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 20 | nmdc:dobj-11-gargwe62 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 21 | nmdc:dobj-11-bxdpkq28 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 22 | nmdc:dobj-11-ven5zv88 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 23 | nmdc:dobj-11-qzb4kt50 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 24 | nmdc:dobj-11-s7dphe48 | O horizon | USA: Colorado, Rocky Mountains | https://data.microbiomedata.org/data/nmdc:ompr... |
| 25 | nmdc:dobj-11-g97pjb32 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 26 | nmdc:dobj-11-wkznj445 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 27 | nmdc:dobj-11-q3de6z81 | O horizon | USA: Colorado, Niwot Ridge | https://data.microbiomedata.org/data/nmdc:ompr... |
| 28 | nmdc:dobj-11-z73a1f14 | O horizon | USA: Colorado, Rocky Mountains | https://data.microbiomedata.org/data/nmdc:ompr... |
| 29 | nmdc:dobj-11-n7hhax28 | O horizon | USA: Colorado, Rocky Mountains | https://data.microbiomedata.org/data/nmdc:ompr... |
Example of what the TSV contig taxa file looks like¶
A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is 1.0. However, there are duplicate rows of taxa, meaning there are actually more than 1.0 for several taxa (though they appear as duplicate rows with 1.0 as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.
tsv_ex_url = final_df.at[0, "url"]
response = requests.get(tsv_ex_url)
tsv_data = StringIO(response.text)
tsv_ex_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()
# Give columns names
tsv_ex_df.columns = ["contig_id", "taxa", "initial_count"]
# sort by taxa
tsv_sorted = tsv_ex_df.sort_values(by="taxa")
# print first 10 rows
tsv_sorted[:10]
| contig_id | taxa | initial_count | |
|---|---|---|---|
| 2445 | nmdc:wfmgas-11-zkxntv88.1_scf_12576_c1 | Archaea;Candidatus Thermoplasmatota;Thermoplas... | 1.0 |
| 10714 | nmdc:wfmgas-11-zkxntv88.1_scf_21632_c1 | Archaea;Euryarchaeota;Halobacteria;Halobacteri... | 1.0 |
| 12319 | nmdc:wfmgas-11-zkxntv88.1_scf_23473_c1 | Archaea;Euryarchaeota;Halobacteria;Halobacteri... | 1.0 |
| 5857 | nmdc:wfmgas-11-zkxntv88.1_scf_16241_c1 | Archaea;Euryarchaeota;Halobacteria;Halobacteri... | 1.0 |
| 4703 | nmdc:wfmgas-11-zkxntv88.1_scf_14997_c1 | Archaea;Euryarchaeota;Halobacteria;Haloferacal... | 1.0 |
| 4311 | nmdc:wfmgas-11-zkxntv88.1_scf_14578_c1 | Archaea;Euryarchaeota;Halobacteria;Haloferacal... | 1.0 |
| 1698 | nmdc:wfmgas-11-zkxntv88.1_scf_11797_c1 | Archaea;Euryarchaeota;Halobacteria;Haloferacal... | 1.0 |
| 825 | nmdc:wfmgas-11-zkxntv88.1_scf_10891_c1 | Archaea;Euryarchaeota;Halobacteria;Haloferacal... | 1.0 |
| 12800 | nmdc:wfmgas-11-zkxntv88.1_scf_24029_c1 | Archaea;Euryarchaeota;Methanobacteria;Methanob... | 1.0 |
| 8056 | nmdc:wfmgas-11-zkxntv88.1_scf_18678_c1 | Archaea;Euryarchaeota;Methanobacteria;Methanob... | 1.0 |
Iterate throught the TSVs to get the contig taxa information¶
Using the Python requests library and the StringIO library, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa. A grouping function is performed on the taxa column and the Pandas size() functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each biosample. After iterating through all of the TSVs, two final taxa dfs are created by concatenating the list of data frames (o_df and m_df).
Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected.
Note this takes several hours to complete.
o_horizon = []
m_horizon = []
errors = []
iteration_counter = 0
for index, row in final_df.iterrows():
iteration_counter += 1
# print an update for every 50 iterations
if iteration_counter % 50 == 0:
print(f"Processed {iteration_counter} rows")
url = row["url"]
horizon = row["soil_horizon"]
dataobj = row["data_ob_id"]
geo_loc = row["geo_loc_name"]
data_ob_id = row["data_ob_id"]
try:
response = requests.get(url)
tsv_data = StringIO(response.text)
tsv_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()
# Give columns names
tsv_df.columns = ["contig_id", "taxa", "initial_count"]
# split taxa column into a list where a semicolon (;) is the delimeter
tsv_df["taxa"] = tsv_df["taxa"].str.split(";")
# Get only the third element of the list of taxa (the phylum), add "Unknown" it it does not include phylum level, and add
# "Unkown" if the taxa value is empty.
tsv_df["taxa"] = tsv_df["taxa"].apply(lambda x: str(x[2]) if isinstance(x, list) and len(x) >= 3
else str(" ".join(x) + " Unknown") if isinstance(x, list) else "Unknown")
# Get relative abundance for the tsv_df
tsv_df = tsv_df.groupby("taxa").size().reset_index(name="count")
total_count = tsv_df["count"].sum()
tsv_df["relative_abundance"] = (tsv_df["count"] / total_count) * 100
# Add geo location to data frame
tsv_df["geo_loc_name"] = geo_loc
# Add biosample id to data frame
tsv_df["data_ob_id"] = dataobj
tsv_df["tsv_url"] = url
# append tsv_df to list depending on the soil horizon type
if horizon == "O horizon":
o_horizon.append(tsv_df)
else:
m_horizon.append(tsv_df)
except Exception as e:
print(f"An error occurred: {e}")
errors.append({
"data_ob_id": dataobj,
"url": url,
"horizon": horizon,
"geo_loc_name": geo_loc,
"data_ob_id": data_ob_id
})
continue
# concatenate list of dfs
o_df = pd.concat(o_horizon)
m_df = pd.concat(m_horizon)
m_df
| taxa | count | relative_abundance | geo_loc_name | data_ob_id | tsv_url | |
|---|---|---|---|---|---|---|
| 0 | Acidimicrobiia | 69 | 0.335196 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-zk896n23 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 1 | Acidithiobacillia | 3 | 0.014574 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-zk896n23 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 2 | Actinomycetes | 12154 | 59.042992 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-zk896n23 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 3 | Agaricomycetes | 8 | 0.038863 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-zk896n23 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 4 | Alphaproteobacteria | 3912 | 19.004129 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-zk896n23 | https://data.microbiomedata.org/data/nmdc:ompr... |
| ... | ... | ... | ... | ... | ... | ... |
| 291 | unclassified Zoopagomycota | 20 | 0.000505 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-1mwrks28 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 292 | unclassified candidate division NC10 | 1819 | 0.045958 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-1mwrks28 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 293 | unclassified candidate division Zixibacteria | 240 | 0.006064 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-1mwrks28 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 294 | unclassified dsDNA viruses, no RNA stage | 10 | 0.000253 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-1mwrks28 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 295 | unclassified viruses | 2 | 0.000051 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-1mwrks28 | https://data.microbiomedata.org/data/nmdc:ompr... |
2582 rows × 6 columns
Look into any errors that occurred from the TSV requests¶
Any TSVs that could not be requested were added to an errors dictionary.
print(errors)
[]
Define a function to calculate abundance¶
A function is defined that takes an input of a dataframe and calculates the average relative abundance of each taxa.
def taxa_abundance(df):
df = df.drop_duplicates(subset=['data_ob_id', 'taxa'])
# pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
wide_df = df.pivot(index = "data_ob_id", columns = "taxa", values = "relative_abundance")
wide_df = wide_df.fillna(0)
wide_df.reset_index(inplace=True)
# convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
melted_df = pd.melt(wide_df, id_vars = "data_ob_id", var_name = "taxa", value_name = "relative_abundance")
# calculate abundance and add column to data frame
final_df = melted_df.groupby("taxa")["relative_abundance"].mean().reset_index(name="avg_relative_abundance")
return final_df
Calculate the abundance of the O and M horizon data frames¶
Using the function defined above, the counts_m and counts_o data frames returned from iterating over the TSV files are used as input into the function, where the average relative abundance calculations are returned as data frames. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.
# caculate abundance for each soil horizon type and get top 25 taxa, grouping the rest
m_final = taxa_abundance(m_df)
o_final = taxa_abundance(o_df)
# combine data frames
o_final["soil_horizon"] = "O"
m_final["soil_horizon"] = "M"
abundance_df = pd.concat([o_final, m_final])
abundance_df
| taxa | avg_relative_abundance | soil_horizon | |
|---|---|---|---|
| 0 | Acidimicrobiia | 0.324138 | O |
| 1 | Acidithiobacillia | 0.031115 | O |
| 2 | Aconoidasida | 0.002462 | O |
| 3 | Actinomycetes | 24.049704 | O |
| 4 | Actinopteri | 0.001235 | O |
| ... | ... | ... | ... |
| 300 | unclassified Zoopagomycota | 0.000070 | M |
| 301 | unclassified candidate division NC10 | 0.106987 | M |
| 302 | unclassified candidate division Zixibacteria | 0.012937 | M |
| 303 | unclassified dsDNA viruses, no RNA stage | 0.000034 | M |
| 304 | unclassified viruses | 0.000952 | M |
604 rows × 3 columns
Plot the taxa abundance of M vs. O horizon soil samples¶
Using the plotly library, the percent abundance of the taxa is plotted as a bar chart - each bar representing the soil horizon and the colors representing the taxa.
# Plot the taxa abundance of each soil type
fig = px.bar(abundance_df, x="soil_horizon", y="avg_relative_abundance", color="taxa",
title = "% Abundance of phylum-level taxa in M and O horizon soil samples in Colorado",
labels = {"soil_horizon": "Soil Horizon", "avg_relative_abundance": "% Abundance"})
fig.update_layout(height=600)
fig.show()
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/plotly/express/_core.py:2065: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
Write a function to calculate the abundance per location¶
This is a function to use with the m_df and o_df outputs from the TSV iteration to calculate the % abundance for each geo_loc_name. It also groups the taxa after the top 5 for each loaction into "other".
def loc_abund(df):
df = df.drop_duplicates(subset=['data_ob_id', 'taxa'])
# pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
wide_df = df.pivot(index = "data_ob_id", columns = "taxa", values = "relative_abundance")
wide_df = wide_df.fillna(0)
wide_df.reset_index(inplace=True)
# Add geo_loc_name column to wide_df
wide_df = pd.merge(wide_df, df[['data_ob_id', 'geo_loc_name']], on='data_ob_id', how='left')
# convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
melted_df = pd.melt(wide_df, id_vars=["data_ob_id", "geo_loc_name"], var_name="taxa", value_name="relative_abundance")
final_df = melted_df.groupby(["geo_loc_name", "taxa"])["relative_abundance"].mean().reset_index(name="avg_relative_abundance")
return final_df
Calculate the abundance of the location data frames¶
Using the function defined above, the m_df and the o_df data frames returned from iterating over the TSV files are used as input into the function, where the final abundance calculations and top 5 taxa are returned as data frames. We do the calculation by grouping by geo_loc_name in order to calculate abundances per location. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.
# caculate abundance for each soil horizon type and get top 5 taxa, grouping the rest
m_loc = loc_abund(m_df)
o_loc = loc_abund(o_df)
# combine data frames
o_loc["soil_horizon"] = "O"
m_loc["soil_horizon"] = "M"
loc_abund_df = pd.concat([o_loc, m_loc])
# Extract only region names from geo_loc_name
loc_abund_df["location"] = loc_abund_df["geo_loc_name"].str.extract(r'Colorado, (.*)')
loc_abund_df
| geo_loc_name | taxa | avg_relative_abundance | soil_horizon | location | |
|---|---|---|---|---|---|
| 0 | USA: Colorado, Niwot Ridge | Acidimicrobiia | 0.295080 | O | Niwot Ridge |
| 1 | USA: Colorado, Niwot Ridge | Acidithiobacillia | 0.035143 | O | Niwot Ridge |
| 2 | USA: Colorado, Niwot Ridge | Aconoidasida | 0.003006 | O | Niwot Ridge |
| 3 | USA: Colorado, Niwot Ridge | Actinomycetes | 22.616634 | O | Niwot Ridge |
| 4 | USA: Colorado, Niwot Ridge | Actinopteri | 0.001069 | O | Niwot Ridge |
| ... | ... | ... | ... | ... | ... |
| 1215 | USA: Colorado, Rocky Mountains | unclassified Zoopagomycota | 0.000000 | M | Rocky Mountains |
| 1216 | USA: Colorado, Rocky Mountains | unclassified candidate division NC10 | 0.065346 | M | Rocky Mountains |
| 1217 | USA: Colorado, Rocky Mountains | unclassified candidate division Zixibacteria | 0.008262 | M | Rocky Mountains |
| 1218 | USA: Colorado, Rocky Mountains | unclassified dsDNA viruses, no RNA stage | 0.000000 | M | Rocky Mountains |
| 1219 | USA: Colorado, Rocky Mountains | unclassified viruses | 0.000751 | M | Rocky Mountains |
1818 rows × 5 columns
Plot the taxa abundance of M and O horizon soil samples for each location¶
Using the plotly library, the percent abundance of the taxa is plotted as a bar chart for each geo location and faceted by soil horizon.
geo_fig = px.bar(loc_abund_df, x = "soil_horizon", y="avg_relative_abundance", color = "taxa",
facet_col = "location",
facet_col_spacing = 0.1,
title = "% Abundance of phylum-level taxa in M and O horizon samples for each Colorado location",
labels = {"geo_loc_name": "Location", "avg_relative_abundance": "% Abundance"},
height = 600)
# update figure to remove "location=" from facet column labels
geo_fig.for_each_annotation(lambda a: a.update(text=a.text.replace("location=", "")))
# show figure
geo_fig.show()