How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?¶

This notebook uses the nmdc_api_utilities package (as of March 2025) to explore how the taxononomic distribution of contigs differ by the mineral and organic soil layers in Colorado. It involves using nmdc_api_utilites objects to make NMDC API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Iterating through the TSV files includes 350+ API calls to get the necessary taxonomic counts and is time consuming.

In [3]:
import requests
import pandas as pd
from io import StringIO
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import nmdc_api_utilities

1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name¶

The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Moutains" for the geo_loc_name. Using the Python package 'nmdc_api_utilities', we can use the get_record_by_filter function to do this. We first need create a BiosampleSearch object to search across the "biosample_set" collections. More information regarding the nmdc_api_utilities package can be found here. We then create a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is returned no matter what. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead. Finally, we convert the biosample results to a dataframe called biosample_df. Note that about 517 biosamples are returned.

In [4]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
from nmdc_api_utilities.data_processing import DataProcessing
# Create a BiosampleSearch object
bs_client = BiosampleSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# define the filter
filter = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}'
# get the results
bs_results = bs_client.get_record_by_filter(filter=filter, fields="id,soil_horizon,geo_loc_name", max_page_size=100, all_pages=True)
# clarify names
for biosample in bs_results:
    biosample["biosample_id"] = biosample.pop("id")

# convert to df
biosample_df = dp_client.convert_to_df(bs_results)

# Adjust geo_loc_name to not be a dictionary
biosample_df["geo_loc_name"] = biosample_df["geo_loc_name"].apply(lambda x: x.get("has_raw_value"))
biosample_df
Out[4]:
soil_horizon geo_loc_name biosample_id
0 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-00m15h97
1 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-06ta8e31
2 O horizon USA: Colorado, Rocky Mountains nmdc:bsm-11-06tgpb52
3 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-0asn5d63
4 M horizon USA: Colorado, North Sterling nmdc:bsm-11-0djp2e45
... ... ... ...
513 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-zhrzwh12
514 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zhzner35
515 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zjsrkd21
516 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zk6h3328
517 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-znvc3c66

518 rows × 3 columns

2. Get all DataObjects that are related to the biosamples found in step 1.¶

We'll do this using the get_linked_instances_and_associate_ids function from the nmdc_api_utilities package.

In [5]:
# Now get all DataObjects linked to these biosamples
biosample_dataobject_dictionary = bs_client.get_linked_instances_and_associate_ids(
    ids=biosample_df["biosample_id"].tolist(),          # list of biosample ids
    types =["nmdc:DataObject"]                          # specify we want DataObjects only
    )

# The result is a dictionary with biosample ids as keys and list of linked DataObjects as values
print(f"Retrieved DataObjects for {len(biosample_dataobject_dictionary)} biosamples.")
Retrieved DataObjects for 506 biosamples.

3. Get records for all DataObjects found in step 2.¶

To do this, we'll collect all the DataObject ids found in step 2 and use the get_batch_records function from the nmdc_api_utilities package. This function allows us to get records in batches, which is useful when dealing with a large number of ids. We'll specify that we want to retrieve the id, data_object_type, and url fields for each DataObject. The results will be stored in a dataframe called dataobject_df.

In [6]:
# Gather all DataObject ids into a single list (from the dictionary of lists), then get their records
dojs = [item for sublist in biosample_dataobject_dictionary.values() for item in sublist]

# Now get records for all DataObjects records and convert to dataframe
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch()
do_results = do_client.get_batch_records(
    id_list=dojs,
    search_field="id",
    fields="id,data_object_type,url"
)
data_object_df = dp_client.convert_to_df(do_results)
# Rename id column to data_object_id for clarity
data_object_df = data_object_df.rename(columns={"id": "data_object_id"})

# Finally, filter these DataObjects to only include those with `data_object_type` of "Scaffold Lineage tsv"
filtered_data_object_df = data_object_df[data_object_df["data_object_type"] == "Scaffold Lineage tsv"]
filtered_data_object_df.head(10)
Out[6]:
data_object_id data_object_type url
5 nmdc:dobj-11-1akcce43 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
54 nmdc:dobj-11-f2bmdw54 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
94 nmdc:dobj-11-xt9amn82 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
101 nmdc:dobj-11-02mx2423 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
148 nmdc:dobj-11-f1pg9z40 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
165 nmdc:dobj-11-k1vrrb83 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
200 nmdc:dobj-11-05e7k522 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns...
312 nmdc:dobj-11-605rmv44 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
355 nmdc:dobj-11-k6xn1112 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns...
376 nmdc:dobj-11-r9jyj090 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns...

4. Associate DataObjects back to Biosamples¶

We'll use the id field from the biosample_df dataframe (which corresponds to ids of Biosample records) and the id field from the dataobject_df dataframe to associate DataObjects back to their respective Biosamples. This will allow us to link the taxonomic data in the DataObjects to the soil horizon information in the Biosamples. The biosample_dataobject_dictionary created in step 2 will be used to map the relationships.

In [7]:
# First define a function to get biosample_id from data_object_id
def get_biosample_id(data_object_id, biosample_dataobject_dict):
    for biosample_id, data_object_list in biosample_dataobject_dict.items():
        if data_object_id in data_object_list:
            return biosample_id
    return None

# Apply the function to add biosample_id column
filtered_data_object_df["biosample_id"] = filtered_data_object_df["data_object_id"].apply(lambda x: get_biosample_id(x, biosample_dataobject_dictionary))

# Merge with biosample_df to get soil_horizon and geo_loc_name
merged_df = pd.merge(filtered_data_object_df, biosample_df, on="biosample_id", how="left")
merged_df.head(10)
/tmp/ipykernel_2260/1847364105.py:9: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[7]:
data_object_id data_object_type url biosample_id soil_horizon geo_loc_name
0 nmdc:dobj-11-1akcce43 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-acqtab53 M horizon USA: Colorado, Niwot Ridge
1 nmdc:dobj-11-f2bmdw54 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-4nb6s749 M horizon USA: Colorado, Central Plains Experimental Range
2 nmdc:dobj-11-xt9amn82 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-0gmd9f09 M horizon USA: Colorado, Niwot Ridge
3 nmdc:dobj-11-02mx2423 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-13rvbb54 M horizon USA: Colorado, Central Plains Experimental Range
4 nmdc:dobj-11-f1pg9z40 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-6vhkk902 M horizon USA: Colorado, Central Plains Experimental Range
5 nmdc:dobj-11-k1vrrb83 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-1fsf9z28 M horizon USA: Colorado, North Sterling
6 nmdc:dobj-11-05e7k522 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... nmdc:bsm-11-6fy8p792 M horizon USA: Colorado, Central Plains Experimental Range
7 nmdc:dobj-11-605rmv44 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-dtngbp38 M horizon USA: Colorado, North Sterling
8 nmdc:dobj-11-k6xn1112 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... nmdc:bsm-11-4wtjyp10 O horizon USA: Colorado, Rocky Mountains
9 nmdc:dobj-11-r9jyj090 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... nmdc:bsm-11-bqwhr666 O horizon USA: Colorado, Niwot Ridge

5. Dereplicate the DataFrame based on data_object_id to ensure uniqueness¶

Prepare a final DataFrame that contains unique entries for each DataObject, ensuring that there are no duplicate records. This step is crucial for accurate analysis and reporting, as it prevents any potential bias or errors that could arise from counting the same DataObject multiple times. This scenario (multiple DataObjects associated with a sinlge Biosample) can result if several biosamples are pooled together for a single metagenome assembly.

In [8]:
# Prepare a final DataFrame that contains unique entries for each DataObject, ensuring that there are no duplicate records. This step is crucial for accurate analysis and reporting, as it prevents any potential bias or errors that could arise from counting the same DataObject multiple times which can result if several biosamples are pooled together in a single metagenome assembly and share the same DataObject.
final_df = merged_df[["data_object_id", "data_object_type", "url","soil_horizon", "geo_loc_name"]].drop_duplicates()
final_df.head(10)
Out[8]:
data_object_id data_object_type url soil_horizon geo_loc_name
0 nmdc:dobj-11-1akcce43 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Niwot Ridge
1 nmdc:dobj-11-f2bmdw54 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range
2 nmdc:dobj-11-xt9amn82 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Niwot Ridge
3 nmdc:dobj-11-02mx2423 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range
4 nmdc:dobj-11-f1pg9z40 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range
5 nmdc:dobj-11-k1vrrb83 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, North Sterling
6 nmdc:dobj-11-05e7k522 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... M horizon USA: Colorado, Central Plains Experimental Range
7 nmdc:dobj-11-605rmv44 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, North Sterling
8 nmdc:dobj-11-k6xn1112 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... O horizon USA: Colorado, Rocky Mountains
9 nmdc:dobj-11-r9jyj090 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... O horizon USA: Colorado, Niwot Ridge

Show how many results have M horizon vs. O horizon¶

The soil_horizon column can be counted using the value_counts() functionality. There are many more M horizon samples than O horizon.

In [9]:
# Show unique soil horizons:
soil_horizons = final_df['soil_horizon'].value_counts()
print(soil_horizons)
soil_horizon
M horizon    177
O horizon     47
Name: count, dtype: int64

Randomly select a subset of these datasets for which to pull information¶

In [10]:
# randomly select 15 data sets in each horizon
n = 15

#list the different types
list_type=soil_horizons.index.tolist()

#for each type, randomly horizon n data sets and save them into list
random_subset=[]
for type in list_type:
    #each data object ID and horizon type
    sample_type=final_df[['data_object_id','soil_horizon']].drop_duplicates()
    #filter to current horizon type
    sample_type=sample_type[sample_type['soil_horizon']==type]
    #randomly horizon n data object IDs in current horizon type
    sample_type=sample_type.sample(n=n, random_state=2)
    #save
    random_subset.append(sample_type)

#resave list as dataframe
random_subset=pd.concat(random_subset).reset_index(drop=True)

#remerge rest of the data for the sampled data sets
final_df=random_subset.merge(final_df,on=['data_object_id','soil_horizon'],how="left")

final_df
Out[10]:
data_object_id soil_horizon data_object_type url geo_loc_name
0 nmdc:dobj-11-mrd6vh74 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
1 nmdc:dobj-11-yxf1ea97 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
2 nmdc:dobj-11-nzmgqh66 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
3 nmdc:dobj-11-q23jn377 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
4 nmdc:dobj-11-d9b81x02 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
5 nmdc:dobj-11-vebnze02 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
6 nmdc:dobj-11-yf8a2568 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
7 nmdc:dobj-11-vr52v577 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, North Sterling
8 nmdc:dobj-11-1mwrks28 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
9 nmdc:dobj-11-t93rat81 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
10 nmdc:dobj-11-r371w335 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, North Sterling
11 nmdc:dobj-11-02mx2423 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
12 nmdc:dobj-11-05e7k522 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Central Plains Experimental Range
13 nmdc:dobj-11-bzzn7310 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, North Sterling
14 nmdc:dobj-11-f0shbt75 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
15 nmdc:dobj-11-bxdpkq28 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
16 nmdc:dobj-11-s7dphe48 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
17 nmdc:dobj-11-wkznj445 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
18 nmdc:dobj-11-cwcas077 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
19 nmdc:dobj-11-f42v2x33 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
20 nmdc:dobj-11-a2g14p83 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
21 nmdc:dobj-11-jbv50k17 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Rocky Mountains
22 nmdc:dobj-11-g97pjb32 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
23 nmdc:dobj-11-parcbj38 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
24 nmdc:dobj-11-k6xn1112 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Rocky Mountains
25 nmdc:dobj-11-mmv19z03 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
26 nmdc:dobj-11-t3ckzq60 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Rocky Mountains
27 nmdc:dobj-11-qzb4kt50 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
28 nmdc:dobj-11-pe7v1f12 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
29 nmdc:dobj-11-gargwe62 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge

Example of what the TSV contig taxa file looks like¶

A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is 1.0. However, there are duplicate rows of taxa, meaning there are actually more than 1.0 for several taxa (though they appear as duplicate rows with 1.0 as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.

In [11]:
tsv_ex_url = final_df.iloc[0]["url"]

response = requests.get(tsv_ex_url)
tsv_data = StringIO(response.text)

tsv_ex_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()

# Give columns names
tsv_ex_df.columns = ["contig_id", "taxa", "initial_count"]

# sort by taxa
tsv_sorted = tsv_ex_df.sort_values(by="taxa")

# print first 10 rows
tsv_sorted[:10]
Out[11]:
contig_id taxa initial_count
198544 nmdc:wfmgan-11-v2j2sd82.1_226793 Archaea;Candidatus Bathyarchaeota;unclassified... 1.0
178664 nmdc:wfmgan-11-v2j2sd82.1_202043 Archaea;Candidatus Bathyarchaeota;unclassified... 1.0
188766 nmdc:wfmgan-11-v2j2sd82.1_214509 Archaea;Candidatus Bathyarchaeota;unclassified... 1.0
82572 nmdc:wfmgan-11-v2j2sd82.1_088319 Archaea;Candidatus Bathyarchaeota;unclassified... 1.0
167330 nmdc:wfmgan-11-v2j2sd82.1_188140 Archaea;Candidatus Diapherotrites;unclassified... 1.0
90275 nmdc:wfmgan-11-v2j2sd82.1_097035 Archaea;Candidatus Korarchaeota;unclassified C... 1.0
247594 nmdc:wfmgan-11-v2j2sd82.1_289263 Archaea;Candidatus Korarchaeota;unclassified C... 1.0
144321 nmdc:wfmgan-11-v2j2sd82.1_160177 Archaea;Candidatus Thermoplasmatota;Candidatus... 1.0
219385 nmdc:wfmgan-11-v2j2sd82.1_253052 Archaea;Candidatus Thermoplasmatota;Candidatus... 1.0
38159 nmdc:wfmgan-11-v2j2sd82.1_039573 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0

Iterate throught the TSVs to get the contig taxa information¶

Using the Python requests library and the StringIO library, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa. A grouping function is performed on the taxa column and the Pandas size() functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each biosample. After iterating through all of the TSVs, two final taxa dfs are created by concatenating the list of data frames (o_df and m_df).

Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected.

In [12]:
o_horizon = []
m_horizon = []
errors = []

iteration_counter = 0


for index, row in final_df.iterrows():
    
    iteration_counter += 1

    # print an update for every 50 iterations
    if iteration_counter % 50 == 0:
        print(f"Processed {iteration_counter} rows")

    url = row["url"]
    horizon = row["soil_horizon"]
    dataobj = row["data_object_id"]
    geo_loc = row["geo_loc_name"]
    data_object_id = row["data_object_id"]

    try:
        response = requests.get(url)
        tsv_data = StringIO(response.text)
    
        tsv_df = pd.read_csv(tsv_data, delimiter="\t")
        tsv_data.close()
    
        # Give columns names
        tsv_df.columns = ["contig_id", "taxa", "initial_count"]
    
        # split taxa column into a list where a semicolon (;) is the delimeter
        tsv_df["taxa"] = tsv_df["taxa"].str.split(";")

        # Get only the third element of the list of taxa (the phylum), add "Unknown" it it does not include phylum level, and add
        # "Unkown" if the taxa value is empty.
        tsv_df["taxa"] = tsv_df["taxa"].apply(lambda x: str(x[2]) if isinstance(x, list) and len(x) >= 3 
                                              else str(" ".join(x) + " Unknown") if isinstance(x, list) else "Unknown")


        # Get relative abundance for the tsv_df
        tsv_df = tsv_df.groupby("taxa").size().reset_index(name="count")
        total_count = tsv_df["count"].sum()
        tsv_df["relative_abundance"] = (tsv_df["count"] / total_count) * 100

        # Add geo location to data frame
        tsv_df["geo_loc_name"] = geo_loc

        # Add biosample id to data frame
        tsv_df["data_object_id"] = dataobj
        tsv_df["tsv_url"] = url

        # append tsv_df to list depending on the soil horizon type
        if horizon == "O horizon":
            o_horizon.append(tsv_df)
        else:
            m_horizon.append(tsv_df)

    except Exception as e:
        print(f"An error occurred: {e}")
        errors.append({
            "data_ob_id": dataobj,
            "url": url,
            "horizon": horizon,
            "geo_loc_name": geo_loc
                                    })
        continue

# concatenate list of dfs
o_df = pd.concat(o_horizon)
m_df = pd.concat(m_horizon)

m_df
Out[12]:
taxa count relative_abundance geo_loc_name data_object_id tsv_url
0 Acidimicrobiia 958 0.375239 USA: Colorado, Niwot Ridge nmdc:dobj-11-mrd6vh74 https://data.microbiomedata.org/data/nmdc:dgns...
1 Acidithiobacillia 49 0.019193 USA: Colorado, Niwot Ridge nmdc:dobj-11-mrd6vh74 https://data.microbiomedata.org/data/nmdc:dgns...
2 Aconoidasida 1 0.000392 USA: Colorado, Niwot Ridge nmdc:dobj-11-mrd6vh74 https://data.microbiomedata.org/data/nmdc:dgns...
3 Actinomycetes 102364 40.094946 USA: Colorado, Niwot Ridge nmdc:dobj-11-mrd6vh74 https://data.microbiomedata.org/data/nmdc:dgns...
4 Actinopteri 2 0.000783 USA: Colorado, Niwot Ridge nmdc:dobj-11-mrd6vh74 https://data.microbiomedata.org/data/nmdc:dgns...
... ... ... ... ... ... ...
274 unclassified Verrucomicrobiota 950 0.039821 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-f0shbt75 https://data.microbiomedata.org/data/nmdc:ompr...
275 unclassified Zoopagomycota 5 0.000210 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-f0shbt75 https://data.microbiomedata.org/data/nmdc:ompr...
276 unclassified candidate division NC10 2859 0.119839 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-f0shbt75 https://data.microbiomedata.org/data/nmdc:ompr...
277 unclassified candidate division Zixibacteria 373 0.015635 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-f0shbt75 https://data.microbiomedata.org/data/nmdc:ompr...
278 unclassified dsDNA viruses, no RNA stage 1 0.000042 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-f0shbt75 https://data.microbiomedata.org/data/nmdc:ompr...

2969 rows × 6 columns

Look into any errors that occurred from the TSV requests¶

Any TSVs that could not be requested were added to an errors dictionary.

In [13]:
print(errors)
[]

Define a function to calculate abundance¶

A function is defined that takes an input of a dataframe and calculates the average relative abundance of each taxa.

In [14]:
def taxa_abundance(df):

    df = df.drop_duplicates(subset=['data_object_id', 'taxa'])

    # pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
    wide_df = df.pivot(index = "data_object_id", columns = "taxa", values = "relative_abundance")
    wide_df = wide_df.fillna(0)
    wide_df.reset_index(inplace=True)
    
    # convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
    melted_df = pd.melt(wide_df, id_vars = "data_object_id", var_name = "taxa", value_name = "relative_abundance")

    # calculate abundance and add column to data frame
    final_df = melted_df.groupby("taxa")["relative_abundance"].mean().reset_index(name="avg_relative_abundance")

    return final_df

Calculate the abundance of the O and M horizon data frames¶

Using the function defined above, the counts_m and counts_o data frames returned from iterating over the TSV files are used as input into the function, where the average relative abundance calculations are returned as data frames. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.

In [15]:
# caculate abundance for each soil horizon type and get top 25 taxa, grouping the rest
m_final = taxa_abundance(m_df)
o_final = taxa_abundance(o_df)

# combine data frames
o_final["soil_horizon"] = "O"
m_final["soil_horizon"] = "M"
abundance_df = pd.concat([o_final, m_final])

abundance_df
Out[15]:
taxa avg_relative_abundance soil_horizon
0 Acidimicrobiia 0.298787 O
1 Acidithiobacillia 0.031719 O
2 Aconoidasida 0.004912 O
3 Actinomycetes 24.717448 O
4 Actinopteri 0.000884 O
... ... ... ...
298 unclassified Zoopagomycota 0.000373 M
299 unclassified candidate division NC10 0.104233 M
300 unclassified candidate division Zixibacteria 0.009093 M
301 unclassified dsDNA viruses, no RNA stage 0.000104 M
302 unclassified viruses 0.000901 M

617 rows × 3 columns

Plot the taxa abundance of M vs. O horizon soil samples¶

Using the plotly library, the percent abundance of the taxa is plotted as a bar chart - each bar representing the soil horizon and the colors representing the taxa.

In [16]:
# Plot the taxa abundance of each soil type
fig = px.bar(abundance_df, x="soil_horizon", y="avg_relative_abundance", color="taxa", 
             title = "% Abundance of phylum-level taxa in M and O horizon soil samples in Colorado", 
             labels = {"soil_horizon": "Soil Horizon", "avg_relative_abundance": "% Abundance"})
    
fig.update_layout(height=600)
fig.show()          

Write a function to calculate the abundance per location¶

This is a function to use with the m_df and o_df outputs from the TSV iteration to calculate the % abundance for each geo_loc_name. It also groups the taxa after the top 5 for each loaction into "other".

In [17]:
def loc_abund(df):

    df = df.drop_duplicates(subset=['data_object_id', 'taxa'])

    # pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
    wide_df = df.pivot(index = "data_object_id", columns = "taxa", values = "relative_abundance")
    wide_df = wide_df.fillna(0)
    wide_df.reset_index(inplace=True)

    # Add geo_loc_name column to wide_df
    wide_df = pd.merge(wide_df, df[['data_object_id', 'geo_loc_name']], on='data_object_id', how='left')
    
    # convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
    melted_df = pd.melt(wide_df, id_vars=["data_object_id", "geo_loc_name"], var_name="taxa", value_name="relative_abundance")

    final_df = melted_df.groupby(["geo_loc_name", "taxa"])["relative_abundance"].mean().reset_index(name="avg_relative_abundance")

    return final_df

Calculate the abundance of the location data frames¶

Using the function defined above, the m_df and the o_df data frames returned from iterating over the TSV files are used as input into the function, where the final abundance calculations and top 5 taxa are returned as data frames. We do the calculation by grouping by geo_loc_name in order to calculate abundances per location. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.

In [18]:
# caculate abundance for each soil horizon type and get top 5 taxa, grouping the rest
m_loc = loc_abund(m_df)
o_loc = loc_abund(o_df)

# combine data frames
o_loc["soil_horizon"] = "O"
m_loc["soil_horizon"] = "M"
loc_abund_df = pd.concat([o_loc, m_loc])

# Extract only region names from geo_loc_name
loc_abund_df["location"] = loc_abund_df["geo_loc_name"].str.extract(r'Colorado, (.*)')

loc_abund_df
Out[18]:
geo_loc_name taxa avg_relative_abundance soil_horizon location
0 USA: Colorado, Niwot Ridge Acidimicrobiia 0.304653 O Niwot Ridge
1 USA: Colorado, Niwot Ridge Acidithiobacillia 0.027184 O Niwot Ridge
2 USA: Colorado, Niwot Ridge Aconoidasida 0.009272 O Niwot Ridge
3 USA: Colorado, Niwot Ridge Actinomycetes 25.186204 O Niwot Ridge
4 USA: Colorado, Niwot Ridge Actinopteri 0.001133 O Niwot Ridge
... ... ... ... ... ...
904 USA: Colorado, North Sterling unclassified Zoopagomycota 0.000000 M North Sterling
905 USA: Colorado, North Sterling unclassified candidate division NC10 0.210047 M North Sterling
906 USA: Colorado, North Sterling unclassified candidate division Zixibacteria 0.009053 M North Sterling
907 USA: Colorado, North Sterling unclassified dsDNA viruses, no RNA stage 0.000000 M North Sterling
908 USA: Colorado, North Sterling unclassified viruses 0.004527 M North Sterling

1537 rows × 5 columns

Plot the taxa abundance of M and O horizon soil samples for each location¶

Using the plotly library, the percent abundance of the taxa is plotted as a bar chart for each geo location and faceted by soil horizon.

In [19]:
geo_fig = px.bar(loc_abund_df, x = "soil_horizon", y="avg_relative_abundance", color = "taxa", 
                 facet_col = "location",
                 facet_col_spacing = 0.1,
                 title = "% Abundance of phylum-level taxa in M and O horizon samples for each Colorado location", 
                 labels = {"geo_loc_name": "Location", "avg_relative_abundance": "% Abundance"},
                 height = 600)
# update figure to remove "location=" from facet column labels
geo_fig.for_each_annotation(lambda a: a.update(text=a.text.replace("location=", "")))

# show figure
geo_fig.show()