How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?¶

This notebook uses the nmdc_api_utilities package (as of March 2025) to explore how the taxononomic distribution of contigs differ by the mineral and organic soil layers in Colorado. It involves using nmdc_api_utilites objects to make NMDC API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Iterating through the TSV files includes 350+ API calls to get the necessary taxonomic counts and is time consuming.

In [3]:
import requests
import pandas as pd
from io import StringIO
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import nmdc_api_utilities

1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name¶

The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Moutains" for the geo_loc_name. Using the Python package 'nmdc_api_utilities', we can use the get_record_by_filter function to do this. We first need create a BiosampleSearch object to search across the "biosample_set" collections. More information regarding the nmdc_api_utilities package can be found here. We then create a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is returned no matter what. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead. Finally, we convert the biosample results to a dataframe called biosample_df. Note that about 517 biosamples are returned.

In [4]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
from nmdc_api_utilities.data_processing import DataProcessing
# Create a BiosampleSearch object
bs_client = BiosampleSearch(env=ENV)
# create a DataProcessing object
dp_client = DataProcessing()
# define the filter
filter = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}'
# get the results
bs_results = bs_client.get_record_by_filter(filter=filter, fields="id,soil_horizon,geo_loc_name", max_page_size=100, all_pages=True)
# clarify names
for biosample in bs_results:
    biosample["biosample_id"] = biosample.pop("id")

# convert to df
biosample_df = dp_client.convert_to_df(bs_results)

# Adjust geo_loc_name to not be a dictionary
biosample_df["geo_loc_name"] = biosample_df["geo_loc_name"].apply(lambda x: x.get("has_raw_value"))
biosample_df
Out[4]:
soil_horizon geo_loc_name biosample_id
0 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-00m15h97
1 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-06ta8e31
2 O horizon USA: Colorado, Rocky Mountains nmdc:bsm-11-06tgpb52
3 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-0asn5d63
4 M horizon USA: Colorado, North Sterling nmdc:bsm-11-0djp2e45
... ... ... ...
522 M horizon USA: Colorado, Central Plains Experimental Range nmdc:bsm-11-zhrzwh12
523 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zhzner35
524 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zjsrkd21
525 O horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-zk6h3328
526 M horizon USA: Colorado, Niwot Ridge nmdc:bsm-11-znvc3c66

527 rows × 3 columns

2. Get all DataObjects that are related to the biosamples found in step 1.¶

We'll do this using the get_linked_instances_and_associate_ids function from the nmdc_api_utilities package.

In [5]:
# Now get all DataObjects linked to these biosamples
biosample_dataobject_dictionary = bs_client.get_linked_instances_and_associate_ids(
    ids=biosample_df["biosample_id"].tolist(),          # list of biosample ids
    types =["nmdc:DataObject"]                          # specify we want DataObjects only
    )

# The result is a dictionary with biosample ids as keys and list of linked DataObjects as values
print(f"Retrieved DataObjects for {len(biosample_dataobject_dictionary)} biosamples.")
Retrieved DataObjects for 506 biosamples.

3. Get records for all DataObjects found in step 2.¶

To do this, we'll collect all the DataObject ids found in step 2 and use the get_batch_records function from the nmdc_api_utilities package. This function allows us to get records in batches, which is useful when dealing with a large number of ids. We'll specify that we want to retrieve the id, data_object_type, and url fields for each DataObject. The results will be stored in a dataframe called dataobject_df.

In [6]:
# Gather all DataObject ids into a single list (from the dictionary of lists), then get their records
dojs = [item for sublist in biosample_dataobject_dictionary.values() for item in sublist]

# Now get records for all DataObjects records and convert to dataframe
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch()
do_results = do_client.get_batch_records(
    id_list=dojs,
    search_field="id",
    fields="id,data_object_type,url"
)
data_object_df = dp_client.convert_to_df(do_results)
# Rename id column to data_object_id for clarity
data_object_df = data_object_df.rename(columns={"id": "data_object_id"})

# Finally, filter these DataObjects to only include those with `data_object_type` of "Scaffold Lineage tsv"
filtered_data_object_df = data_object_df[data_object_df["data_object_type"] == "Scaffold Lineage tsv"]
filtered_data_object_df.head(10)
Out[6]:
data_object_id data_object_type url
6 nmdc:dobj-11-24dkan34 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
113 nmdc:dobj-11-60bj8051 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
114 nmdc:dobj-11-63rwtd28 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
141 nmdc:dobj-11-f77hjz36 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
155 nmdc:dobj-11-jjeye060 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns...
218 nmdc:dobj-11-31bhfp21 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns...
441 nmdc:dobj-11-bxdpkq28 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
466 nmdc:dobj-11-mjk8t717 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
522 nmdc:dobj-11-7qxt8k38 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...
558 nmdc:dobj-11-jg6awk48 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr...

4. Associate DataObjects back to Biosamples¶

We'll use the id field from the biosample_df dataframe (which corresponds to ids of Biosample records) and the id field from the dataobject_df dataframe to associate DataObjects back to their respective Biosamples. This will allow us to link the taxonomic data in the DataObjects to the soil horizon information in the Biosamples. The biosample_dataobject_dictionary created in step 2 will be used to map the relationships.

In [7]:
# First define a function to get biosample_id from data_object_id
def get_biosample_id(data_object_id, biosample_dataobject_dict):
    for biosample_id, data_object_list in biosample_dataobject_dict.items():
        if data_object_id in data_object_list:
            return biosample_id
    return None

# Apply the function to add biosample_id column
filtered_data_object_df["biosample_id"] = filtered_data_object_df["data_object_id"].apply(lambda x: get_biosample_id(x, biosample_dataobject_dictionary))

# Merge with biosample_df to get soil_horizon and geo_loc_name
merged_df = pd.merge(filtered_data_object_df, biosample_df, on="biosample_id", how="left")
merged_df.head(10)
Out[7]:
data_object_id data_object_type url biosample_id soil_horizon geo_loc_name
0 nmdc:dobj-11-24dkan34 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-758yrs33 M horizon USA: Colorado, Rocky Mountains
1 nmdc:dobj-11-60bj8051 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-n2j65n33 M horizon USA: Colorado, Niwot Ridge
2 nmdc:dobj-11-63rwtd28 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-qdrkx092 M horizon USA: Colorado, Central Plains Experimental Range
3 nmdc:dobj-11-f77hjz36 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-3nqemc54 M horizon USA: Colorado, Central Plains Experimental Range
4 nmdc:dobj-11-jjeye060 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... nmdc:bsm-11-3dmqhk09 M horizon USA: Colorado, Niwot Ridge
5 nmdc:dobj-11-31bhfp21 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... nmdc:bsm-11-62mwwc29 M horizon USA: Colorado, Central Plains Experimental Range
6 nmdc:dobj-11-bxdpkq28 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-16yk2m32 O horizon USA: Colorado, Niwot Ridge
7 nmdc:dobj-11-mjk8t717 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-7wqz4q59 M horizon USA: Colorado, Central Plains Experimental Range
8 nmdc:dobj-11-7qxt8k38 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-8f1c6514 M horizon USA: Colorado, Niwot Ridge
9 nmdc:dobj-11-jg6awk48 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-11-00m15h97 M horizon USA: Colorado, Central Plains Experimental Range

5. Dereplicate the DataFrame based on data_object_id to ensure uniqueness¶

Prepare a final DataFrame that contains unique entries for each DataObject, ensuring that there are no duplicate records. This step is crucial for accurate analysis and reporting, as it prevents any potential bias or errors that could arise from counting the same DataObject multiple times. This scenario (multiple DataObjects associated with a sinlge Biosample) can result if several biosamples are pooled together for a single metagenome assembly.

In [8]:
# Prepare a final DataFrame that contains unique entries for each DataObject, ensuring that there are no duplicate records. This step is crucial for accurate analysis and reporting, as it prevents any potential bias or errors that could arise from counting the same DataObject multiple times which can result if several biosamples are pooled together in a single metagenome assembly and share the same DataObject.
final_df = merged_df[["data_object_id", "data_object_type", "url","soil_horizon", "geo_loc_name"]].drop_duplicates()
final_df.head(10)
Out[8]:
data_object_id data_object_type url soil_horizon geo_loc_name
0 nmdc:dobj-11-24dkan34 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Rocky Mountains
1 nmdc:dobj-11-60bj8051 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Niwot Ridge
2 nmdc:dobj-11-63rwtd28 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range
3 nmdc:dobj-11-f77hjz36 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range
4 nmdc:dobj-11-jjeye060 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... M horizon USA: Colorado, Niwot Ridge
5 nmdc:dobj-11-31bhfp21 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... M horizon USA: Colorado, Central Plains Experimental Range
6 nmdc:dobj-11-bxdpkq28 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... O horizon USA: Colorado, Niwot Ridge
7 nmdc:dobj-11-mjk8t717 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range
8 nmdc:dobj-11-7qxt8k38 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Niwot Ridge
9 nmdc:dobj-11-jg6awk48 Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... M horizon USA: Colorado, Central Plains Experimental Range

Show how many results have M horizon vs. O horizon¶

The soil_horizon column can be counted using the value_counts() functionality. There are many more M horizon samples than O horizon.

In [9]:
# Show unique soil horizons:
soil_horizons = final_df['soil_horizon'].value_counts()
print(soil_horizons)
soil_horizon
M horizon    177
O horizon     47
Name: count, dtype: int64

Randomly select a subset of these datasets for which to pull information¶

In [10]:
# randomly select 15 data sets in each horizon
n = 15

#list the different types
list_type=soil_horizons.index.tolist()

#for each type, randomly horizon n data sets and save them into list
random_subset=[]
for type in list_type:
    #each data object ID and horizon type
    sample_type=final_df[['data_object_id','soil_horizon']].drop_duplicates()
    #filter to current horizon type
    sample_type=sample_type[sample_type['soil_horizon']==type]
    #randomly horizon n data object IDs in current horizon type
    sample_type=sample_type.sample(n=n, random_state=2)
    #save
    random_subset.append(sample_type)

#resave list as dataframe
random_subset=pd.concat(random_subset).reset_index(drop=True)

#remerge rest of the data for the sampled data sets
final_df=random_subset.merge(final_df,on=['data_object_id','soil_horizon'],how="left")

final_df
Out[10]:
data_object_id soil_horizon data_object_type url geo_loc_name
0 nmdc:dobj-11-xe3ayr72 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, North Sterling
1 nmdc:dobj-11-9ntp4y50 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
2 nmdc:dobj-11-94bcvb82 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
3 nmdc:dobj-11-np43c928 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
4 nmdc:dobj-11-bn6nzj80 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, North Sterling
5 nmdc:dobj-11-yn322e86 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
6 nmdc:dobj-11-t8asqx28 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
7 nmdc:dobj-11-gz9tyw27 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Central Plains Experimental Range
8 nmdc:dobj-11-tjezch87 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
9 nmdc:dobj-11-x5gf9468 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
10 nmdc:dobj-11-t96svj25 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
11 nmdc:dobj-11-f77hjz36 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
12 nmdc:dobj-11-mjk8t717 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
13 nmdc:dobj-11-kh9wyj03 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, North Sterling
14 nmdc:dobj-11-byrh0203 M horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Central Plains Experimental Range
15 nmdc:dobj-11-jbv50k17 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Rocky Mountains
16 nmdc:dobj-11-jp45gr33 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
17 nmdc:dobj-11-s7dphe48 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
18 nmdc:dobj-11-n6ek3222 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
19 nmdc:dobj-11-0fzrkr83 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
20 nmdc:dobj-11-zjhawc26 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Rocky Mountains
21 nmdc:dobj-11-5s6wt245 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
22 nmdc:dobj-11-ven5zv88 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
23 nmdc:dobj-11-ffa16a78 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
24 nmdc:dobj-11-bxdpkq28 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Niwot Ridge
25 nmdc:dobj-11-n7hhax28 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
26 nmdc:dobj-11-05tb2m57 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
27 nmdc:dobj-11-1kspk380 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Niwot Ridge
28 nmdc:dobj-11-n8d4sw75 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:ompr... USA: Colorado, Rocky Mountains
29 nmdc:dobj-11-t3ckzq60 O horizon Scaffold Lineage tsv https://data.microbiomedata.org/data/nmdc:dgns... USA: Colorado, Rocky Mountains

Example of what the TSV contig taxa file looks like¶

A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is 1.0. However, there are duplicate rows of taxa, meaning there are actually more than 1.0 for several taxa (though they appear as duplicate rows with 1.0 as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.

In [11]:
tsv_ex_url = final_df.iloc[0]["url"]

response = requests.get(tsv_ex_url)
tsv_data = StringIO(response.text)

tsv_ex_df = pd.read_csv(tsv_data, delimiter="\t")
tsv_data.close()

# Give columns names
tsv_ex_df.columns = ["contig_id", "taxa", "initial_count"]

# sort by taxa
tsv_sorted = tsv_ex_df.sort_values(by="taxa")

# print first 10 rows
tsv_sorted[:10]
Out[11]:
contig_id taxa initial_count
12115 nmdc:wfmgas-11-t453yc57.1_scf_469_c1 Archaea 1.0
12881 nmdc:wfmgas-11-t453yc57.1_scf_5515_c1 Archaea;Candidatus Bathyarchaeota;unclassified... 1.0
4596 nmdc:wfmgas-11-t453yc57.1_scf_15286_c1 Archaea;Candidatus Diapherotrites;unclassified... 1.0
16613 nmdc:wfmgas-11-t453yc57.1_scf_9588_c1 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0
3558 nmdc:wfmgas-11-t453yc57.1_scf_14097_c1 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0
16304 nmdc:wfmgas-11-t453yc57.1_scf_9244_c1 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0
9863 nmdc:wfmgas-11-t453yc57.1_scf_235_c1 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0
225 nmdc:wfmgas-11-t453yc57.1_scf_10265_c1 Archaea;Candidatus Thermoplasmatota;Thermoplas... 1.0
11792 nmdc:wfmgas-11-t453yc57.1_scf_4358_c1 Archaea;Euryarchaeota;Archaeoglobi;Archaeoglob... 1.0
9537 nmdc:wfmgas-11-t453yc57.1_scf_21186_c1 Archaea;Euryarchaeota;Archaeoglobi;Archaeoglob... 1.0

Iterate throught the TSVs to get the contig taxa information¶

Using the Python requests library and the StringIO library, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa. A grouping function is performed on the taxa column and the Pandas size() functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each biosample. After iterating through all of the TSVs, two final taxa dfs are created by concatenating the list of data frames (o_df and m_df).

Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected.

In [12]:
o_horizon = []
m_horizon = []
errors = []

iteration_counter = 0


for index, row in final_df.iterrows():
    
    iteration_counter += 1

    # print an update for every 50 iterations
    if iteration_counter % 50 == 0:
        print(f"Processed {iteration_counter} rows")

    url = row["url"]
    horizon = row["soil_horizon"]
    dataobj = row["data_object_id"]
    geo_loc = row["geo_loc_name"]
    data_object_id = row["data_object_id"]

    try:
        response = requests.get(url)
        tsv_data = StringIO(response.text)
    
        tsv_df = pd.read_csv(tsv_data, delimiter="\t")
        tsv_data.close()
    
        # Give columns names
        tsv_df.columns = ["contig_id", "taxa", "initial_count"]
    
        # split taxa column into a list where a semicolon (;) is the delimeter
        tsv_df["taxa"] = tsv_df["taxa"].str.split(";")

        # Get only the third element of the list of taxa (the phylum), add "Unknown" it it does not include phylum level, and add
        # "Unkown" if the taxa value is empty.
        tsv_df["taxa"] = tsv_df["taxa"].apply(lambda x: str(x[2]) if isinstance(x, list) and len(x) >= 3 
                                              else str(" ".join(x) + " Unknown") if isinstance(x, list) else "Unknown")


        # Get relative abundance for the tsv_df
        tsv_df = tsv_df.groupby("taxa").size().reset_index(name="count")
        total_count = tsv_df["count"].sum()
        tsv_df["relative_abundance"] = (tsv_df["count"] / total_count) * 100

        # Add geo location to data frame
        tsv_df["geo_loc_name"] = geo_loc

        # Add biosample id to data frame
        tsv_df["data_object_id"] = dataobj
        tsv_df["tsv_url"] = url

        # append tsv_df to list depending on the soil horizon type
        if horizon == "O horizon":
            o_horizon.append(tsv_df)
        else:
            m_horizon.append(tsv_df)

    except Exception as e:
        print(f"An error occurred: {e}")
        errors.append({
            "data_ob_id": dataobj,
            "url": url,
            "horizon": horizon,
            "geo_loc_name": geo_loc
                                    })
        continue

# concatenate list of dfs
o_df = pd.concat(o_horizon)
m_df = pd.concat(m_horizon)

m_df
Out[12]:
taxa count relative_abundance geo_loc_name data_object_id tsv_url
0 Acidimicrobiia 202 1.189285 USA: Colorado, North Sterling nmdc:dobj-11-xe3ayr72 https://data.microbiomedata.org/data/nmdc:ompr...
1 Acidithiobacillia 2 0.011775 USA: Colorado, North Sterling nmdc:dobj-11-xe3ayr72 https://data.microbiomedata.org/data/nmdc:ompr...
2 Actinomycetes 5449 32.081248 USA: Colorado, North Sterling nmdc:dobj-11-xe3ayr72 https://data.microbiomedata.org/data/nmdc:ompr...
3 Agaricomycetes 2 0.011775 USA: Colorado, North Sterling nmdc:dobj-11-xe3ayr72 https://data.microbiomedata.org/data/nmdc:ompr...
4 Alphaproteobacteria 2144 12.622903 USA: Colorado, North Sterling nmdc:dobj-11-xe3ayr72 https://data.microbiomedata.org/data/nmdc:ompr...
... ... ... ... ... ... ...
186 unclassified Synergistota 2 0.004763 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-byrh0203 https://data.microbiomedata.org/data/nmdc:ompr...
187 unclassified Thermoproteota 1 0.002381 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-byrh0203 https://data.microbiomedata.org/data/nmdc:ompr...
188 unclassified Verrucomicrobiota 43 0.102400 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-byrh0203 https://data.microbiomedata.org/data/nmdc:ompr...
189 unclassified candidate division NC10 17 0.040484 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-byrh0203 https://data.microbiomedata.org/data/nmdc:ompr...
190 unclassified candidate division Zixibacteria 7 0.016670 USA: Colorado, Central Plains Experimental Range nmdc:dobj-11-byrh0203 https://data.microbiomedata.org/data/nmdc:ompr...

2370 rows × 6 columns

Look into any errors that occurred from the TSV requests¶

Any TSVs that could not be requested were added to an errors dictionary.

In [13]:
print(errors)
[]

Define a function to calculate abundance¶

A function is defined that takes an input of a dataframe and calculates the average relative abundance of each taxa.

In [14]:
def taxa_abundance(df):

    df = df.drop_duplicates(subset=['data_object_id', 'taxa'])

    # pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
    wide_df = df.pivot(index = "data_object_id", columns = "taxa", values = "relative_abundance")
    wide_df = wide_df.fillna(0)
    wide_df.reset_index(inplace=True)
    
    # convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
    melted_df = pd.melt(wide_df, id_vars = "data_object_id", var_name = "taxa", value_name = "relative_abundance")

    # calculate abundance and add column to data frame
    final_df = melted_df.groupby("taxa")["relative_abundance"].mean().reset_index(name="avg_relative_abundance")

    return final_df

Calculate the abundance of the O and M horizon data frames¶

Using the function defined above, the counts_m and counts_o data frames returned from iterating over the TSV files are used as input into the function, where the average relative abundance calculations are returned as data frames. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.

In [15]:
# caculate abundance for each soil horizon type and get top 25 taxa, grouping the rest
m_final = taxa_abundance(m_df)
o_final = taxa_abundance(o_df)

# combine data frames
o_final["soil_horizon"] = "O"
m_final["soil_horizon"] = "M"
abundance_df = pd.concat([o_final, m_final])

abundance_df
Out[15]:
taxa avg_relative_abundance soil_horizon
0 Acidimicrobiia 0.403785 O
1 Acidithiobacillia 0.030016 O
2 Aconoidasida 0.004107 O
3 Actinomycetes 22.405088 O
4 Actinopteri 0.001641 O
... ... ... ...
280 unclassified Zoopagomycota 0.000249 M
281 unclassified candidate division NC10 0.088085 M
282 unclassified candidate division Zixibacteria 0.010473 M
283 unclassified dsDNA viruses, no RNA stage 0.000338 M
284 unclassified viruses 0.000100 M

601 rows × 3 columns

Plot the taxa abundance of M vs. O horizon soil samples¶

Using the plotly library, the percent abundance of the taxa is plotted as a bar chart - each bar representing the soil horizon and the colors representing the taxa.

In [16]:
# Plot the taxa abundance of each soil type
fig = px.bar(abundance_df, x="soil_horizon", y="avg_relative_abundance", color="taxa", 
             title = "% Abundance of phylum-level taxa in M and O horizon soil samples in Colorado", 
             labels = {"soil_horizon": "Soil Horizon", "avg_relative_abundance": "% Abundance"})
    
fig.update_layout(height=600)
fig.show()          

Write a function to calculate the abundance per location¶

This is a function to use with the m_df and o_df outputs from the TSV iteration to calculate the % abundance for each geo_loc_name. It also groups the taxa after the top 5 for each loaction into "other".

In [17]:
def loc_abund(df):

    df = df.drop_duplicates(subset=['data_object_id', 'taxa'])

    # pivot the table to find all combos of biosample and taxa - set NAs to 0 for relative abundance
    wide_df = df.pivot(index = "data_object_id", columns = "taxa", values = "relative_abundance")
    wide_df = wide_df.fillna(0)
    wide_df.reset_index(inplace=True)

    # Add geo_loc_name column to wide_df
    wide_df = pd.merge(wide_df, df[['data_object_id', 'geo_loc_name']], on='data_object_id', how='left')
    
    # convert wide_df back with relative_abundances set to 0 for samples that were missing taxa
    melted_df = pd.melt(wide_df, id_vars=["data_object_id", "geo_loc_name"], var_name="taxa", value_name="relative_abundance")

    final_df = melted_df.groupby(["geo_loc_name", "taxa"])["relative_abundance"].mean().reset_index(name="avg_relative_abundance")

    return final_df

Calculate the abundance of the location data frames¶

Using the function defined above, the m_df and the o_df data frames returned from iterating over the TSV files are used as input into the function, where the final abundance calculations and top 5 taxa are returned as data frames. We do the calculation by grouping by geo_loc_name in order to calculate abundances per location. We then concatenate the two data frames together, creating a new column for soil_horizon, where the value is either O or M, depending on which data frame it originally came from.

In [18]:
# caculate abundance for each soil horizon type and get top 5 taxa, grouping the rest
m_loc = loc_abund(m_df)
o_loc = loc_abund(o_df)

# combine data frames
o_loc["soil_horizon"] = "O"
m_loc["soil_horizon"] = "M"
loc_abund_df = pd.concat([o_loc, m_loc])

# Extract only region names from geo_loc_name
loc_abund_df["location"] = loc_abund_df["geo_loc_name"].str.extract(r'Colorado, (.*)')

loc_abund_df
Out[18]:
geo_loc_name taxa avg_relative_abundance soil_horizon location
0 USA: Colorado, Niwot Ridge Acidimicrobiia 0.438931 O Niwot Ridge
1 USA: Colorado, Niwot Ridge Acidithiobacillia 0.031287 O Niwot Ridge
2 USA: Colorado, Niwot Ridge Aconoidasida 0.008996 O Niwot Ridge
3 USA: Colorado, Niwot Ridge Actinomycetes 20.892101 O Niwot Ridge
4 USA: Colorado, Niwot Ridge Actinopteri 0.001932 O Niwot Ridge
... ... ... ... ... ...
1135 USA: Colorado, Rocky Mountains unclassified Zoopagomycota 0.000000 M Rocky Mountains
1136 USA: Colorado, Rocky Mountains unclassified candidate division NC10 0.095782 M Rocky Mountains
1137 USA: Colorado, Rocky Mountains unclassified candidate division Zixibacteria 0.023120 M Rocky Mountains
1138 USA: Colorado, Rocky Mountains unclassified dsDNA viruses, no RNA stage 0.000000 M Rocky Mountains
1139 USA: Colorado, Rocky Mountains unclassified viruses 0.000000 M Rocky Mountains

1772 rows × 5 columns

Plot the taxa abundance of M and O horizon soil samples for each location¶

Using the plotly library, the percent abundance of the taxa is plotted as a bar chart for each geo location and faceted by soil horizon.

In [19]:
geo_fig = px.bar(loc_abund_df, x = "soil_horizon", y="avg_relative_abundance", color = "taxa", 
                 facet_col = "location",
                 facet_col_spacing = 0.1,
                 title = "% Abundance of phylum-level taxa in M and O horizon samples for each Colorado location", 
                 labels = {"geo_loc_name": "Location", "avg_relative_abundance": "% Abundance"},
                 height = 600)
# update figure to remove "location=" from facet column labels
geo_fig.for_each_annotation(lambda a: a.update(text=a.text.replace("location=", "")))

# show figure
geo_fig.show()