How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?¶
This notebook uses the nmdc_api_utilities package (as of March 2025) to explore how the taxononomic distribution of contigs differ by the mineral and organic soil layers in Colorado. It involves using nmdc_api_utilites objects to make NMDC API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Iterating through the TSV files includes 350+ API calls to get the necessary taxonomic counts and is time consuming.
import requests
import pandas as pd
from io import StringIO
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import nmdc_api_utilities
1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name¶
The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Moutains" for the geo_loc_name.
Using the Python package 'nmdc_api_utilities', we can use the get_record_by_filter function to do this. We first need create a BiosampleSearch object to search across the "biosample_set" collections. More information regarding the nmdc_api_utilities package can be found here. We then create a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is returned no matter what. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead. Finally, we convert the biosample results to a dataframe called biosample_df. Note that about 517 biosamples are returned.
from nmdc_api_utilities.biosample_search import BiosampleSearch
from nmdc_api_utilities.data_processing import DataProcessing
# Create a BiosampleSearch object
bs_client = BiosampleSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# define the filter
filter = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}'
# get the results
bs_results = bs_client.get_record_by_filter(filter=filter, fields="id,soil_horizon,geo_loc_name", max_page_size=100, all_pages=True)
# clarify names
for biosample in bs_results:
biosample["biosample_id"] = biosample.pop("id")
# convert to df
biosample_df = dp_client.convert_to_df(bs_results)
# Adjust geo_loc_name to not be a dictionary
biosample_df["geo_loc_name"] = biosample_df["geo_loc_name"].apply(lambda x: x.get("has_raw_value"))
biosample_df
| soil_horizon | geo_loc_name | biosample_id | |
|---|---|---|---|
| 0 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-00m15h97 |
| 1 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-06ta8e31 |
| 2 | O horizon | USA: Colorado, Rocky Mountains | nmdc:bsm-11-06tgpb52 |
| 3 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-0asn5d63 |
| 4 | M horizon | USA: Colorado, North Sterling | nmdc:bsm-11-0djp2e45 |
| ... | ... | ... | ... |
| 513 | M horizon | USA: Colorado, Central Plains Experimental Range | nmdc:bsm-11-zhrzwh12 |
| 514 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zhzner35 |
| 515 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zjsrkd21 |
| 516 | O horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-zk6h3328 |
| 517 | M horizon | USA: Colorado, Niwot Ridge | nmdc:bsm-11-znvc3c66 |
518 rows × 3 columns
2. Get all DataObjects that are related to the biosamples found in step 1.¶
We'll do this using the get_linked_instances_and_associate_ids function from the nmdc_api_utilities package.
# Now get all DataObjects linked to these biosamples
biosample_dataobject_dictionary = bs_client.get_linked_instances_and_associate_ids(
ids=biosample_df["biosample_id"].tolist(), # list of biosample ids
types =["nmdc:DataObject"] # specify we want DataObjects only
)
# The result is a dictionary with biosample ids as keys and list of linked DataObjects as values
print(f"Retrieved DataObjects for {len(biosample_dataobject_dictionary)} biosamples.")
Retrieved DataObjects for 506 biosamples.
3. Get records for all DataObjects found in step 2.¶
To do this, we'll collect all the DataObject ids found in step 2 and use the get_batch_records function from the nmdc_api_utilities package. This function allows us to get records in batches, which is useful when dealing with a large number of ids. We'll specify that we want to retrieve the id, data_object_type, and url fields for each DataObject. The results will be stored in a dataframe called dataobject_df.
# Gather all DataObject ids into a single list (from the dictionary of lists), then get their records
dojs = [item for sublist in biosample_dataobject_dictionary.values() for item in sublist]
# Now get records for all DataObjects records and convert to dataframe
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch()
do_results = do_client.get_batch_records(
id_list=dojs,
search_field="id",
fields="id,data_object_type,url"
)
data_object_df = dp_client.convert_to_df(do_results)
# Rename id column to data_object_id for clarity
data_object_df = data_object_df.rename(columns={"id": "data_object_id"})
# Finally, filter these DataObjects to only include those with `data_object_type` of "Scaffold Lineage tsv"
filtered_data_object_df = data_object_df[data_object_df["data_object_type"] == "Scaffold Lineage tsv"]
filtered_data_object_df.head(10)
| data_object_id | data_object_type | url | |
|---|---|---|---|
| 5 | nmdc:dobj-11-1akcce43 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 54 | nmdc:dobj-11-f2bmdw54 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 94 | nmdc:dobj-11-xt9amn82 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 101 | nmdc:dobj-11-02mx2423 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 148 | nmdc:dobj-11-f1pg9z40 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 165 | nmdc:dobj-11-k1vrrb83 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 200 | nmdc:dobj-11-05e7k522 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... |
| 312 | nmdc:dobj-11-605rmv44 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... |
| 355 | nmdc:dobj-11-k6xn1112 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... |
| 376 | nmdc:dobj-11-r9jyj090 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... |
4. Associate DataObjects back to Biosamples¶
We'll use the id field from the biosample_df dataframe (which corresponds to ids of Biosample records) and the id field from the dataobject_df dataframe to associate DataObjects back to their respective Biosamples. This will allow us to link the taxonomic data in the DataObjects to the soil horizon information in the Biosamples. The biosample_dataobject_dictionary created in step 2 will be used to map the relationships.
# First define a function to get biosample_id from data_object_id
def get_biosample_id(data_object_id, biosample_dataobject_dict):
for biosample_id, data_object_list in biosample_dataobject_dict.items():
if data_object_id in data_object_list:
return biosample_id
return None
# Apply the function to add biosample_id column
filtered_data_object_df["biosample_id"] = filtered_data_object_df["data_object_id"].apply(lambda x: get_biosample_id(x, biosample_dataobject_dictionary))
# Merge with biosample_df to get soil_horizon and geo_loc_name
merged_df = pd.merge(filtered_data_object_df, biosample_df, on="biosample_id", how="left")
merged_df.head(10)
/tmp/ipykernel_2260/1847364105.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| data_object_id | data_object_type | url | biosample_id | soil_horizon | geo_loc_name | |
|---|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-1akcce43 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-acqtab53 | M horizon | USA: Colorado, Niwot Ridge |
| 1 | nmdc:dobj-11-f2bmdw54 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-4nb6s749 | M horizon | USA: Colorado, Central Plains Experimental Range |
| 2 | nmdc:dobj-11-xt9amn82 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-0gmd9f09 | M horizon | USA: Colorado, Niwot Ridge |
| 3 | nmdc:dobj-11-02mx2423 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-13rvbb54 | M horizon | USA: Colorado, Central Plains Experimental Range |
| 4 | nmdc:dobj-11-f1pg9z40 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-6vhkk902 | M horizon | USA: Colorado, Central Plains Experimental Range |
| 5 | nmdc:dobj-11-k1vrrb83 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-1fsf9z28 | M horizon | USA: Colorado, North Sterling |
| 6 | nmdc:dobj-11-05e7k522 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | nmdc:bsm-11-6fy8p792 | M horizon | USA: Colorado, Central Plains Experimental Range |
| 7 | nmdc:dobj-11-605rmv44 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-11-dtngbp38 | M horizon | USA: Colorado, North Sterling |
| 8 | nmdc:dobj-11-k6xn1112 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | nmdc:bsm-11-4wtjyp10 | O horizon | USA: Colorado, Rocky Mountains |
| 9 | nmdc:dobj-11-r9jyj090 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | nmdc:bsm-11-bqwhr666 | O horizon | USA: Colorado, Niwot Ridge |
5. Dereplicate the DataFrame based on data_object_id to ensure uniqueness¶
Prepare a final DataFrame that contains unique entries for each DataObject, ensuring that there are no duplicate records. This step is crucial for accurate analysis and reporting, as it prevents any potential bias or errors that could arise from counting the same DataObject multiple times. This scenario (multiple DataObjects associated with a sinlge Biosample) can result if several biosamples are pooled together for a single metagenome assembly.
# Prepare a final DataFrame that contains unique entries for each DataObject, ensuring that there are no duplicate records. This step is crucial for accurate analysis and reporting, as it prevents any potential bias or errors that could arise from counting the same DataObject multiple times which can result if several biosamples are pooled together in a single metagenome assembly and share the same DataObject.
final_df = merged_df[["data_object_id", "data_object_type", "url","soil_horizon", "geo_loc_name"]].drop_duplicates()
final_df.head(10)
| data_object_id | data_object_type | url | soil_horizon | geo_loc_name | |
|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-1akcce43 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, Niwot Ridge |
| 1 | nmdc:dobj-11-f2bmdw54 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, Central Plains Experimental Range |
| 2 | nmdc:dobj-11-xt9amn82 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, Niwot Ridge |
| 3 | nmdc:dobj-11-02mx2423 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, Central Plains Experimental Range |
| 4 | nmdc:dobj-11-f1pg9z40 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, Central Plains Experimental Range |
| 5 | nmdc:dobj-11-k1vrrb83 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, North Sterling |
| 6 | nmdc:dobj-11-05e7k522 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | M horizon | USA: Colorado, Central Plains Experimental Range |
| 7 | nmdc:dobj-11-605rmv44 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | M horizon | USA: Colorado, North Sterling |
| 8 | nmdc:dobj-11-k6xn1112 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | O horizon | USA: Colorado, Rocky Mountains |
| 9 | nmdc:dobj-11-r9jyj090 | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | O horizon | USA: Colorado, Niwot Ridge |
Show how many results have M horizon vs. O horizon¶
The soil_horizon column can be counted using the value_counts() functionality. There are many more M horizon samples than O horizon.
# Show unique soil horizons:
soil_horizons = final_df['soil_horizon'].value_counts()
print(soil_horizons)
soil_horizon M horizon 177 O horizon 47 Name: count, dtype: int64
Randomly select a subset of these datasets for which to pull information¶
# randomly select 15 data sets in each horizon
n = 15
#list the different types
list_type=soil_horizons.index.tolist()
#for each type, randomly horizon n data sets and save them into list
random_subset=[]
for type in list_type:
#each data object ID and horizon type
sample_type=final_df[['data_object_id','soil_horizon']].drop_duplicates()
#filter to current horizon type
sample_type=sample_type[sample_type['soil_horizon']==type]
#randomly horizon n data object IDs in current horizon type
sample_type=sample_type.sample(n=n, random_state=2)
#save
random_subset.append(sample_type)
#resave list as dataframe
random_subset=pd.concat(random_subset).reset_index(drop=True)
#remerge rest of the data for the sampled data sets
final_df=random_subset.merge(final_df,on=['data_object_id','soil_horizon'],how="left")
final_df
| data_object_id | soil_horizon | data_object_type | url | geo_loc_name | |
|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-mrd6vh74 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Niwot Ridge |
| 1 | nmdc:dobj-11-yxf1ea97 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Central Plains Experimental Range |
| 2 | nmdc:dobj-11-nzmgqh66 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 3 | nmdc:dobj-11-q23jn377 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Central Plains Experimental Range |
| 4 | nmdc:dobj-11-d9b81x02 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Central Plains Experimental Range |
| 5 | nmdc:dobj-11-vebnze02 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 6 | nmdc:dobj-11-yf8a2568 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 7 | nmdc:dobj-11-vr52v577 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, North Sterling |
| 8 | nmdc:dobj-11-1mwrks28 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 9 | nmdc:dobj-11-t93rat81 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Niwot Ridge |
| 10 | nmdc:dobj-11-r371w335 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, North Sterling |
| 11 | nmdc:dobj-11-02mx2423 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Central Plains Experimental Range |
| 12 | nmdc:dobj-11-05e7k522 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Central Plains Experimental Range |
| 13 | nmdc:dobj-11-bzzn7310 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, North Sterling |
| 14 | nmdc:dobj-11-f0shbt75 | M horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Central Plains Experimental Range |
| 15 | nmdc:dobj-11-bxdpkq28 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 16 | nmdc:dobj-11-s7dphe48 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Rocky Mountains |
| 17 | nmdc:dobj-11-wkznj445 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 18 | nmdc:dobj-11-cwcas077 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Niwot Ridge |
| 19 | nmdc:dobj-11-f42v2x33 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 20 | nmdc:dobj-11-a2g14p83 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Niwot Ridge |
| 21 | nmdc:dobj-11-jbv50k17 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Rocky Mountains |
| 22 | nmdc:dobj-11-g97pjb32 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 23 | nmdc:dobj-11-parcbj38 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 24 | nmdc:dobj-11-k6xn1112 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Rocky Mountains |
| 25 | nmdc:dobj-11-mmv19z03 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Rocky Mountains |
| 26 | nmdc:dobj-11-t3ckzq60 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:dgns... | USA: Colorado, Rocky Mountains |
| 27 | nmdc:dobj-11-qzb4kt50 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 28 | nmdc:dobj-11-pe7v1f12 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
| 29 | nmdc:dobj-11-gargwe62 | O horizon | Scaffold Lineage tsv | https://data.microbiomedata.org/data/nmdc:ompr... | USA: Colorado, Niwot Ridge |
Example of what the TSV contig taxa file looks like¶
A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is 1.0. However, there are duplicate rows of taxa, meaning there are actually more than 1.0 for several taxa (though they appear as duplicate rows with 1.0 as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.
tsv_ex_url = final_df.iloc[0]["url"]
response = requests.get(tsv_ex_url)
tsv_data = StringIO(response.text)
tsv_ex_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()
# Give columns names
tsv_ex_df.columns = ["contig_id", "taxa", "initial_count"]
# sort by taxa
tsv_sorted = tsv_ex_df.sort_values(by="taxa")
# print first 10 rows
tsv_sorted[:10]
| contig_id | taxa | initial_count | |
|---|---|---|---|
| 198544 | nmdc:wfmgan-11-v2j2sd82.1_226793 | Archaea;Candidatus Bathyarchaeota;unclassified... | 1.0 |
| 178664 | nmdc:wfmgan-11-v2j2sd82.1_202043 | Archaea;Candidatus Bathyarchaeota;unclassified... | 1.0 |
| 188766 | nmdc:wfmgan-11-v2j2sd82.1_214509 | Archaea;Candidatus Bathyarchaeota;unclassified... | 1.0 |
| 82572 | nmdc:wfmgan-11-v2j2sd82.1_088319 | Archaea;Candidatus Bathyarchaeota;unclassified... | 1.0 |
| 167330 | nmdc:wfmgan-11-v2j2sd82.1_188140 | Archaea;Candidatus Diapherotrites;unclassified... | 1.0 |
| 90275 | nmdc:wfmgan-11-v2j2sd82.1_097035 | Archaea;Candidatus Korarchaeota;unclassified C... | 1.0 |
| 247594 | nmdc:wfmgan-11-v2j2sd82.1_289263 | Archaea;Candidatus Korarchaeota;unclassified C... | 1.0 |
| 144321 | nmdc:wfmgan-11-v2j2sd82.1_160177 | Archaea;Candidatus Thermoplasmatota;Candidatus... | 1.0 |
| 219385 | nmdc:wfmgan-11-v2j2sd82.1_253052 | Archaea;Candidatus Thermoplasmatota;Candidatus... | 1.0 |
| 38159 | nmdc:wfmgan-11-v2j2sd82.1_039573 | Archaea;Candidatus Thermoplasmatota;Thermoplas... | 1.0 |
Iterate throught the TSVs to get the contig taxa information¶
Using the Python requests library and the StringIO library, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa. A grouping function is performed on the taxa column and the Pandas size() functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each biosample. After iterating through all of the TSVs, two final taxa dfs are created by concatenating the list of data frames (o_df and m_df).
Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected.
o_horizon = []
m_horizon = []
errors = []
iteration_counter = 0
for index, row in final_df.iterrows():
iteration_counter += 1
# print an update for every 50 iterations
if iteration_counter % 50 == 0:
print(f"Processed {iteration_counter} rows")
url = row["url"]
horizon = row["soil_horizon"]
dataobj = row["data_object_id"]
geo_loc = row["geo_loc_name"]
data_object_id = row["data_object_id"]
try:
response = requests.get(url)
tsv_data = StringIO(response.text)
tsv_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()
# Give columns names
tsv_df.columns = ["contig_id", "taxa", "initial_count"]
# split taxa column into a list where a semicolon (;) is the delimeter
tsv_df["taxa"] = tsv_df["taxa"].str.split(";")
# Get only the third element of the list of taxa (the phylum), add "Unknown" it it does not include phylum level, and add
# "Unkown" if the taxa value is empty.
tsv_df["taxa"] = tsv_df["taxa"].apply(lambda x: str(x[2]) if isinstance(x, list) and len(x) >= 3
else str(" ".join(x) + " Unknown") if isinstance(x, list) else "Unknown")
# Get relative abundance for the tsv_df
tsv_df = tsv_df.groupby("taxa").size().reset_index(name="count")
total_count = tsv_df["count"].sum()
tsv_df["relative_abundance"] = (tsv_df["count"] / total_count) * 100
# Add geo location to data frame
tsv_df["geo_loc_name"] = geo_loc
# Add biosample id to data frame
tsv_df["data_object_id"] = dataobj
tsv_df["tsv_url"] = url
# append tsv_df to list depending on the soil horizon type
if horizon == "O horizon":
o_horizon.append(tsv_df)
else:
m_horizon.append(tsv_df)
except Exception as e:
print(f"An error occurred: {e}")
errors.append({
"data_ob_id": dataobj,
"url": url,
"horizon": horizon,
"geo_loc_name": geo_loc
})
continue
# concatenate list of dfs
o_df = pd.concat(o_horizon)
m_df = pd.concat(m_horizon)
m_df
| taxa | count | relative_abundance | geo_loc_name | data_object_id | tsv_url | |
|---|---|---|---|---|---|---|
| 0 | Acidimicrobiia | 958 | 0.375239 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-mrd6vh74 | https://data.microbiomedata.org/data/nmdc:dgns... |
| 1 | Acidithiobacillia | 49 | 0.019193 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-mrd6vh74 | https://data.microbiomedata.org/data/nmdc:dgns... |
| 2 | Aconoidasida | 1 | 0.000392 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-mrd6vh74 | https://data.microbiomedata.org/data/nmdc:dgns... |
| 3 | Actinomycetes | 102364 | 40.094946 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-mrd6vh74 | https://data.microbiomedata.org/data/nmdc:dgns... |
| 4 | Actinopteri | 2 | 0.000783 | USA: Colorado, Niwot Ridge | nmdc:dobj-11-mrd6vh74 | https://data.microbiomedata.org/data/nmdc:dgns... |
| ... | ... | ... | ... | ... | ... | ... |
| 274 | unclassified Verrucomicrobiota | 950 | 0.039821 | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-f0shbt75 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 275 | unclassified Zoopagomycota | 5 | 0.000210 | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-f0shbt75 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 276 | unclassified candidate division NC10 | 2859 | 0.119839 | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-f0shbt75 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 277 | unclassified candidate division Zixibacteria | 373 | 0.015635 | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-f0shbt75 | https://data.microbiomedata.org/data/nmdc:ompr... |
| 278 | unclassified dsDNA viruses, no RNA stage | 1 | 0.000042 | USA: Colorado, Central Plains Experimental Range | nmdc:dobj-11-f0shbt75 | https://data.microbiomedata.org/data/nmdc:ompr... |
2969 rows × 6 columns
Look into any errors that occurred from the TSV requests¶
Any TSVs that could not be requested were added to an errors dictionary.
print(errors)
[]
Define a function to calculate abundance¶
A function is defined that takes an input of a dataframe and calculates the average relative abundance of each taxa.
def taxa_abundance(df):
df = df.drop_duplicates(subset=['data_object_id', 'taxa'])
# pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
wide_df = df.pivot(index = "data_object_id", columns = "taxa", values = "relative_abundance")
wide_df = wide_df.fillna(0)
wide_df.reset_index(inplace=True)
# convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
melted_df = pd.melt(wide_df, id_vars = "data_object_id", var_name = "taxa", value_name = "relative_abundance")
# calculate abundance and add column to data frame
final_df = melted_df.groupby("taxa")["relative_abundance"].mean().reset_index(name="avg_relative_abundance")
return final_df
Calculate the abundance of the O and M horizon data frames¶
Using the function defined above, the counts_m and counts_o data frames returned from iterating over the TSV files are used as input into the function, where the average relative abundance calculations are returned as data frames. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.
# caculate abundance for each soil horizon type and get top 25 taxa, grouping the rest
m_final = taxa_abundance(m_df)
o_final = taxa_abundance(o_df)
# combine data frames
o_final["soil_horizon"] = "O"
m_final["soil_horizon"] = "M"
abundance_df = pd.concat([o_final, m_final])
abundance_df
| taxa | avg_relative_abundance | soil_horizon | |
|---|---|---|---|
| 0 | Acidimicrobiia | 0.298787 | O |
| 1 | Acidithiobacillia | 0.031719 | O |
| 2 | Aconoidasida | 0.004912 | O |
| 3 | Actinomycetes | 24.717448 | O |
| 4 | Actinopteri | 0.000884 | O |
| ... | ... | ... | ... |
| 298 | unclassified Zoopagomycota | 0.000373 | M |
| 299 | unclassified candidate division NC10 | 0.104233 | M |
| 300 | unclassified candidate division Zixibacteria | 0.009093 | M |
| 301 | unclassified dsDNA viruses, no RNA stage | 0.000104 | M |
| 302 | unclassified viruses | 0.000901 | M |
617 rows × 3 columns
Plot the taxa abundance of M vs. O horizon soil samples¶
Using the plotly library, the percent abundance of the taxa is plotted as a bar chart - each bar representing the soil horizon and the colors representing the taxa.
# Plot the taxa abundance of each soil type
fig = px.bar(abundance_df, x="soil_horizon", y="avg_relative_abundance", color="taxa",
title = "% Abundance of phylum-level taxa in M and O horizon soil samples in Colorado",
labels = {"soil_horizon": "Soil Horizon", "avg_relative_abundance": "% Abundance"})
fig.update_layout(height=600)
fig.show()
Write a function to calculate the abundance per location¶
This is a function to use with the m_df and o_df outputs from the TSV iteration to calculate the % abundance for each geo_loc_name. It also groups the taxa after the top 5 for each loaction into "other".
def loc_abund(df):
df = df.drop_duplicates(subset=['data_object_id', 'taxa'])
# pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
wide_df = df.pivot(index = "data_object_id", columns = "taxa", values = "relative_abundance")
wide_df = wide_df.fillna(0)
wide_df.reset_index(inplace=True)
# Add geo_loc_name column to wide_df
wide_df = pd.merge(wide_df, df[['data_object_id', 'geo_loc_name']], on='data_object_id', how='left')
# convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
melted_df = pd.melt(wide_df, id_vars=["data_object_id", "geo_loc_name"], var_name="taxa", value_name="relative_abundance")
final_df = melted_df.groupby(["geo_loc_name", "taxa"])["relative_abundance"].mean().reset_index(name="avg_relative_abundance")
return final_df
Calculate the abundance of the location data frames¶
Using the function defined above, the m_df and the o_df data frames returned from iterating over the TSV files are used as input into the function, where the final abundance calculations and top 5 taxa are returned as data frames. We do the calculation by grouping by geo_loc_name in order to calculate abundances per location. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.
# caculate abundance for each soil horizon type and get top 5 taxa, grouping the rest
m_loc = loc_abund(m_df)
o_loc = loc_abund(o_df)
# combine data frames
o_loc["soil_horizon"] = "O"
m_loc["soil_horizon"] = "M"
loc_abund_df = pd.concat([o_loc, m_loc])
# Extract only region names from geo_loc_name
loc_abund_df["location"] = loc_abund_df["geo_loc_name"].str.extract(r'Colorado, (.*)')
loc_abund_df
| geo_loc_name | taxa | avg_relative_abundance | soil_horizon | location | |
|---|---|---|---|---|---|
| 0 | USA: Colorado, Niwot Ridge | Acidimicrobiia | 0.304653 | O | Niwot Ridge |
| 1 | USA: Colorado, Niwot Ridge | Acidithiobacillia | 0.027184 | O | Niwot Ridge |
| 2 | USA: Colorado, Niwot Ridge | Aconoidasida | 0.009272 | O | Niwot Ridge |
| 3 | USA: Colorado, Niwot Ridge | Actinomycetes | 25.186204 | O | Niwot Ridge |
| 4 | USA: Colorado, Niwot Ridge | Actinopteri | 0.001133 | O | Niwot Ridge |
| ... | ... | ... | ... | ... | ... |
| 904 | USA: Colorado, North Sterling | unclassified Zoopagomycota | 0.000000 | M | North Sterling |
| 905 | USA: Colorado, North Sterling | unclassified candidate division NC10 | 0.210047 | M | North Sterling |
| 906 | USA: Colorado, North Sterling | unclassified candidate division Zixibacteria | 0.009053 | M | North Sterling |
| 907 | USA: Colorado, North Sterling | unclassified dsDNA viruses, no RNA stage | 0.000000 | M | North Sterling |
| 908 | USA: Colorado, North Sterling | unclassified viruses | 0.004527 | M | North Sterling |
1537 rows × 5 columns
Plot the taxa abundance of M and O horizon soil samples for each location¶
Using the plotly library, the percent abundance of the taxa is plotted as a bar chart for each geo location and faceted by soil horizon.
geo_fig = px.bar(loc_abund_df, x = "soil_horizon", y="avg_relative_abundance", color = "taxa",
facet_col = "location",
facet_col_spacing = 0.1,
title = "% Abundance of phylum-level taxa in M and O horizon samples for each Colorado location",
labels = {"geo_loc_name": "Location", "avg_relative_abundance": "% Abundance"},
height = 600)
# update figure to remove "location=" from facet column labels
geo_fig.for_each_annotation(lambda a: a.update(text=a.text.replace("location=", "")))
# show figure
geo_fig.show()