How does the taxonomic distribution of contigs differ by soil layer (mineral vs organic) in Colorado?¶

This notebook uses the NMDC API endpoints to explore how the taxonomic distribution of metagenome contigs differ by the mineral and organic soil layers in Colorado. It involves 9 API requests to reach the scaffold lineage TSV data objects in order to analyze the taxanomic distribution. Downloading the taxonomic results files may be time consuming.

In [1]:
# Load essential libraries
library(jsonlite, warn.conflicts=FALSE)
library(dplyr, warn.conflicts=FALSE)
library(tidyr, warn.conflicts=FALSE)
library(readr, warn.conflicts=FALSE)
library(ggplot2, warn.conflicts=FALSE)

if(Sys.getenv("COLAB_BACKEND_VERSION") == "") source("../../utility_functions.R")

if(Sys.getenv("COLAB_BACKEND_VERSION") != "") source("http://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/utility_functions.R")

1. Get all biosamples where soil_horizon exists and the geo_loc_name has "Colorado" in the name¶

The first step in answering how the taxonomic distribution of contigs differ by soil layer is to get a list of all the biosamples that have metadata for soil_horizon and a string matching "Colorado, Rocky Mountains" for the geo_loc_name. We use the get_all_results function (defined in utility_functions.R) to do this. We query the biosample_set collection with a mongo-like filter of {"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}, a maximum page size of 100, and specifying that we want three fields returned id, soil_horizon, and geo_loc_name. Note that id is always returned. Since we will be joining the results of multiple API requests with a field of id for different collections, we can change the name of the id key to be more explicit - calling it biosample_id instead.

In [2]:
# Get biosamples using get_all_results function
biosample_df <- get_all_results(
    collection = 'biosample_set', 
    filter_text = '{"soil_horizon":{"$exists": true}, "geo_loc_name.has_raw_value": {"$regex": "Colorado"}}', 
    max_page_size = 100, 
    fields = 'id,soil_horizon,geo_loc_name'
    )

# Clarify the column names
biosample_df <- biosample_df %>%
    unnest(cols = geo_loc_name, names_sep = "_") %>% 
    rename(biosample_id = id,
           geo_loc_name = geo_loc_name_has_raw_value) %>%
    distinct()
head(biosample_df)
A tibble: 6 × 4
biosample_idsoil_horizongeo_loc_namegeo_loc_name_type
<chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValue
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValue
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValue
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValue
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValue
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValue

2. Get all Pooling results where the Pooling has_input are the biosample ids¶

We use the get_results_by_id function to get a list of all pooling results whose has_input values are any of the biosample_ids we retrieved in step 1. After, the pooling results are unnested to a flat data frame, and the names are cleaned up so it is clear which collection the results are from. Because Pooling is a subclass of MaterialProcessing the pooling records are found in the material_processing_set.

In [3]:
pooling_df <- get_results_by_id(
    collection = 'material_processing_set',
    match_id_field = 'has_input',
    id_list = biosample_df$biosample_id,
    fields = 'id,has_input,has_output',
    max_page_size = 20
)

# Unnest the has_input and has_output columns, get unique results, and rename the columns.
pooling_df2 <- pooling_df %>%
    unnest(cols = c(has_input, has_output), names_sep = "_") %>%
    distinct() %>%
    rename(pooling_id = id,
           biosample_id = has_input,
           pooling_has_output = has_output) %>%
    filter(grepl(pattern = "poolp", x = pooling_id, fixed = TRUE))
head(pooling_df2)
A tibble: 6 × 3
pooling_idbiosample_idpooling_has_output
<chr><chr><chr>
nmdc:poolp-11-0xm4qn65nmdc:bsm-11-0yw1rj05nmdc:procsm-11-vqjw0h14
nmdc:poolp-11-0xm4qn65nmdc:bsm-11-5wwh4520nmdc:procsm-11-vqjw0h14
nmdc:poolp-11-0xm4qn65nmdc:bsm-11-asgy0756nmdc:procsm-11-vqjw0h14
nmdc:poolp-11-2rdwfn64nmdc:bsm-11-1x3q4h45nmdc:procsm-11-8g7d3a43
nmdc:poolp-11-2rdwfn64nmdc:bsm-11-xtdee386nmdc:procsm-11-8g7d3a43
nmdc:poolp-11-2rdwfn64nmdc:bsm-11-1dxe8825nmdc:procsm-11-8g7d3a43

Merge the biosample and pooling dataframes together to get a dataframe with biosample and pooling data.

In [4]:
biosample_df2 <- left_join(biosample_df, pooling_df2, by = 'biosample_id') %>%
    filter(!is.na(pooling_id))
head(biosample_df2)
A tibble: 6 × 6
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idpooling_has_output
<chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-ay38nw70nmdc:procsm-11-mr5hf033

From the pooling_has_output IDs (all starting with "nmdc:procsm"), we can tell that the output from the pooling step is a ProcessedSample. Therefore, we will rename the column from pooling_has_output to processed_sample_id.

In [5]:
biosample_df3 <- biosample_df2 %>%
    rename(processed_sample_id = pooling_has_output) 
head(biosample_df3)
A tibble: 6 × 6
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_id
<chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-ay38nw70nmdc:procsm-11-mr5hf033

3. Get Extraction records where processed_sample_id identifier is the has_input to the Extraction¶

The get_id_results function is used, again (you can see the pattern), again querying the material_processing_set (because Extraction records are also a subclass of MaterialProcessing using the processed_sample1 identifier as the has_input for the material_processing_set. The resulting dataframe is unnested and the names are adjusted to make it clear which set the inputs, outputs, and ids are from.

In [6]:
extraction_df <- get_results_by_id(
    collection = 'material_processing_set',
    match_id_field = 'has_input',
    id_list = unique(biosample_df3$processed_sample_id),
    fields = 'id,has_input,has_output',
    max_page_size = 20
)

extraction_df <- extraction_df %>%
    unnest(cols = c(has_input, has_output), names_sep = "_") %>%
    distinct() %>%
    rename(extraction_id = id,
           processed_sample_id = has_input,
           extraction_has_output = has_output)
head(extraction_df)
A tibble: 6 × 3
processed_sample_idextraction_has_outputextraction_id
<chr><chr><chr>
nmdc:procsm-11-49bwy122nmdc:procsm-11-kwaaah42nmdc:extrp-11-fsv8td81
nmdc:procsm-11-kngzyt90nmdc:procsm-11-h9s7h174nmdc:extrp-11-v25scb12
nmdc:procsm-11-mr5hf033nmdc:procsm-11-7qy2y664nmdc:extrp-11-gnvf5s35
nmdc:procsm-11-33n4p085nmdc:procsm-11-6xc6vy98nmdc:extrp-11-j5qc7973
nmdc:procsm-11-2fxf0e98nmdc:procsm-11-x763xr38nmdc:extrp-11-y5ewyv43
nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92

Merge the extraction data with the biosample, pooling, and processed sample data

In [7]:
biosample_df4 <- biosample_df3 %>%
    left_join(extraction_df, by = join_by(processed_sample_id))
head(biosample_df4)
A tibble: 6 × 8
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idextraction_has_outputextraction_id
<chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34nmdc:procsm-11-edpstj65nmdc:extrp-11-76s2tz21
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70nmdc:procsm-11-nrrknt87nmdc:extrp-11-qg3zf244
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-ay38nw70nmdc:procsm-11-mr5hf033nmdc:procsm-11-7qy2y664nmdc:extrp-11-gnvf5s35

From the extraction_has_output IDs (all starting with "nmdc:procsm"), we can tell that the output from the extraction step is a ProcessedSample. Therefore, we will rename the column from extraction_has_output to processed_sample_id2.

In [8]:
biosample_df5 <- biosample_df4 %>%
    rename(processed_sample_id2 = extraction_has_output)
head(biosample_df5)
A tibble: 6 × 8
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idprocessed_sample_id2extraction_id
<chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34nmdc:procsm-11-edpstj65nmdc:extrp-11-76s2tz21
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70nmdc:procsm-11-nrrknt87nmdc:extrp-11-qg3zf244
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-ay38nw70nmdc:procsm-11-mr5hf033nmdc:procsm-11-7qy2y664nmdc:extrp-11-gnvf5s35

4. Get the LibraryPreparation records¶

Using the processed_sample_id2 identifiers from the last query as the has_input again for the the material_processing_set (because LibraryPreparation is a subclass of MaterialProcessing), we get a new batch of results, returning the library preparation identifiers, inputs and outputs. The results is unnested and names are clarified to demonstrate they are associated with LibraryPreparation records.

In [9]:
library_prep_df <- get_results_by_id(
    collection = 'material_processing_set',
    match_id_field = 'has_input',
    id_list = unique(biosample_df5$processed_sample_id2),
    fields = 'id,has_input,has_output',
    max_page_size = 20
)

library_prep_df <- library_prep_df %>%
    unnest(cols = c(has_input,has_output), names_sep = "_") %>%
    distinct() %>%
    rename(library_preparation_id = id,
           processed_sample_id2 = has_input,
           library_preparation_has_output = has_output)
head(library_prep_df)
A tibble: 6 × 3
processed_sample_id2library_preparation_has_outputlibrary_preparation_id
<chr><chr><chr>
nmdc:procsm-11-kwaaah42nmdc:procsm-11-kfkbxp22nmdc:libprp-11-k5j44e20
nmdc:procsm-11-7qy2y664nmdc:procsm-11-wd4s5f38nmdc:libprp-11-wv6p0032
nmdc:procsm-11-x763xr38nmdc:procsm-11-9ghwha16nmdc:libprp-11-gasf6t26
nmdc:procsm-11-h9s7h174nmdc:procsm-11-4z512838nmdc:libprp-11-t70f6032
nmdc:procsm-11-6xc6vy98nmdc:procsm-11-44e5ds31nmdc:libprp-11-8bzn7n07
nmdc:procsm-11-q086v208nmdc:procsm-11-jkvhv341nmdc:libprp-11-24s1rh35

Merge the library preparation data with the biosample, pooling, processed sample, extraction, and processed sample data

In [10]:
biosample_df6 <- biosample_df5 %>%
    left_join(library_prep_df, by = join_by(processed_sample_id2))
head(biosample_df6)
A tibble: 6 × 10
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idprocessed_sample_id2extraction_idlibrary_preparation_has_outputlibrary_preparation_id
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34nmdc:procsm-11-edpstj65nmdc:extrp-11-76s2tz21nmdc:procsm-11-tq69qx97nmdc:libprp-11-pqjwcw60
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041nmdc:procsm-11-v7s6qh96nmdc:libprp-11-4qwse385
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92nmdc:procsm-11-jkvhv341nmdc:libprp-11-24s1rh35
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70nmdc:procsm-11-nrrknt87nmdc:extrp-11-qg3zf244nmdc:procsm-11-t397mj03nmdc:libprp-11-p07zpd31
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-ay38nw70nmdc:procsm-11-mr5hf033nmdc:procsm-11-7qy2y664nmdc:extrp-11-gnvf5s35nmdc:procsm-11-wd4s5f38nmdc:libprp-11-wv6p0032

Again, from the library_preparation_has_output IDs (all starting with "nmdc:procsm"), we can tell that the output to the library preparation step is a ProcessedSample. Therefore, we will rename the column from library_preparation_has_output to processed_sample_id3. |

In [11]:
biosample_df7 <- biosample_df6 %>%
    rename(processed_sample_id3 = library_preparation_has_output)
head(biosample_df7)
A tibble: 6 × 10
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idprocessed_sample_id2extraction_idprocessed_sample_id3library_preparation_id
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34nmdc:procsm-11-edpstj65nmdc:extrp-11-76s2tz21nmdc:procsm-11-tq69qx97nmdc:libprp-11-pqjwcw60
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041nmdc:procsm-11-v7s6qh96nmdc:libprp-11-4qwse385
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92nmdc:procsm-11-jkvhv341nmdc:libprp-11-24s1rh35
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70nmdc:procsm-11-nrrknt87nmdc:extrp-11-qg3zf244nmdc:procsm-11-t397mj03nmdc:libprp-11-p07zpd31
nmdc:bsm-11-0f43ab20M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-ay38nw70nmdc:procsm-11-mr5hf033nmdc:procsm-11-7qy2y664nmdc:extrp-11-gnvf5s35nmdc:procsm-11-wd4s5f38nmdc:libprp-11-wv6p0032

5. Get NucleotideSequencing records from the processed sample identifiers¶

Using the third batch of processed sample identifiers, we query the data_generation_set on the has_input field to retreive NucleotideSequencing record (because NucleotideSequencing is a subclass of DataGeneration). The id and has_input field names are changed to specify that they came from the NucleotideSequencing records.

In [12]:
data_generation_df <- get_results_by_id(
    collection = 'data_generation_set',
    match_id_field = 'has_input',
    id_list = unique(biosample_df7$processed_sample_id3),
    fields = 'id,has_input',
    max_page_size = 20
)

data_generation_df <- data_generation_df %>%
    unnest(cols = c(has_input), names_sep = "_") %>%
    rename(data_generation_id = id,
           processed_sample_id3 = has_input)
head(data_generation_df)
A tibble: 6 × 2
processed_sample_id3data_generation_id
<chr><chr>
nmdc:procsm-11-3s5m9a70nmdc:omprc-11-2937gz63
nmdc:procsm-11-43n6yz70nmdc:omprc-11-g1n61y55
nmdc:procsm-11-44e5ds31nmdc:dgns-11-ekte1238
nmdc:procsm-11-4jj6k690nmdc:omprc-11-yt96hb84
nmdc:procsm-11-4z512838nmdc:omprc-11-afejca38
nmdc:procsm-11-7cpyc435nmdc:omprc-11-by9r5p41

Merge the data generation records with the biosample, pooling, processed sample, extraction, processed sample, library preparation, and processed sample records.

In [13]:
biosample_df8 <- biosample_df7 %>%
    left_join(data_generation_df, by = join_by(processed_sample_id3)) %>%
    filter(!is.na(data_generation_id))
head(biosample_df8)
A tibble: 6 × 11
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idprocessed_sample_id2extraction_idprocessed_sample_id3library_preparation_iddata_generation_id
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34nmdc:procsm-11-edpstj65nmdc:extrp-11-76s2tz21nmdc:procsm-11-tq69qx97nmdc:libprp-11-pqjwcw60nmdc:omprc-11-769ab655
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041nmdc:procsm-11-v7s6qh96nmdc:libprp-11-4qwse385nmdc:omprc-11-597mc608
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92nmdc:procsm-11-jkvhv341nmdc:libprp-11-24s1rh35nmdc:dgns-11-j3w06312
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70nmdc:procsm-11-nrrknt87nmdc:extrp-11-qg3zf244nmdc:procsm-11-t397mj03nmdc:libprp-11-p07zpd31nmdc:dgns-11-jxh3ht55
nmdc:bsm-11-0gmd9f09M horizonUSA: Colorado, Niwot Ridge nmdc:TextValuenmdc:poolp-11-j2a8me71nmdc:procsm-11-ajxtnq50nmdc:procsm-11-9txgw141nmdc:extrp-11-q6h50527nmdc:procsm-11-ftq9ab63nmdc:libprp-11-qxxjk032nmdc:omprc-11-gaptm502

6. Get the MetagenomeAnnotation ids using the DataGeneration identifiers¶

The workflow_execution_set is queried to retrieve MetagenomeAnnotation records (because MetagenomeAnnotation is a subclass of WorkflowExecution) using the identifiers obtained from the data generation to match with the was_informed_by field in the workflow_execution_set. Field names are clarified, once again to specify the collection they came from.

In [14]:
metagenome_annotation_df <- get_results_by_id(
    collection = 'workflow_execution_set',
    match_id_field = 'was_informed_by',
    id_list = unique(biosample_df8$data_generation_id),
    fields = 'id,was_informed_by,has_output,type,version',
    max_page_size = 20
    ) 
    
head(metagenome_annotation_df)
A data.frame: 6 × 5
idtypehas_outputwas_informed_byversion
<chr><chr><list><list><chr>
1nmdc:wfmag-11-2araz898.1nmdc:MagsAnalysisnmdc:dobj-11-qcjne029, nmdc:dobj-11-v6v4az06, nmdc:dobj-11-99ze1q07, nmdc:dobj-11-82mxmg53, nmdc:dobj-11-tdk66274, nmdc:dobj-11-x0k49271, nmdc:dobj-11-hv9we043, nmdc:dobj-11-hbmfj707, nmdc:dobj-11-970hbm36nmdc:omprc-11-63ajbd04v1.3.16
2nmdc:wfmag-11-3pdsac32.1nmdc:MagsAnalysisnmdc:dobj-11-bjarz597, nmdc:dobj-11-5sa1hc37, nmdc:dobj-11-ypaebx18, nmdc:dobj-11-f4v00f74, nmdc:dobj-11-pgtn0f33, nmdc:dobj-11-ftexk841, nmdc:dobj-11-v80h9572, nmdc:dobj-11-rcrgkr35, nmdc:dobj-11-13a4c730nmdc:dgns-11-j3w06312v1.3.16
3nmdc:wfmag-11-6beb5p13.1nmdc:MagsAnalysisnmdc:dobj-11-te4k2925, nmdc:dobj-11-7n3mbm30, nmdc:dobj-11-804v6747, nmdc:dobj-11-rfr5yn37, nmdc:dobj-11-7mb5fc28, nmdc:dobj-11-9tmx4w60, nmdc:dobj-11-6b2ytn94, nmdc:dobj-11-meg6gk11, nmdc:dobj-11-m78rp150nmdc:dgns-11-jxh3ht55v1.3.16
4nmdc:wfmag-11-7628dd79.1nmdc:MagsAnalysisnmdc:dobj-11-y1es6k19, nmdc:dobj-11-2542ex33, nmdc:dobj-11-8zz56n07, nmdc:dobj-11-av2mf726, nmdc:dobj-11-8bega838, nmdc:dobj-11-t551g482, nmdc:dobj-11-vkd6b435, nmdc:dobj-11-5zkkqt82, nmdc:dobj-11-rrn00m11nmdc:dgns-11-ekte1238v1.3.16
5nmdc:wfmag-11-7w0dea36.1nmdc:MagsAnalysisnmdc:dobj-11-72x3pe97, nmdc:dobj-11-1357qx92, nmdc:dobj-11-vqbe9z28, nmdc:dobj-11-pzxbjr81, nmdc:dobj-11-md40ag65, nmdc:dobj-11-a259cq41, nmdc:dobj-11-52va9x46, nmdc:dobj-11-jv4sz961, nmdc:dobj-11-gf8d1x07nmdc:omprc-11-sz2d4412v1.3.16
6nmdc:wfmag-11-94b5df26.2nmdc:MagsAnalysisnmdc:dobj-11-xs2pey73, nmdc:dobj-11-fmh8pk94, nmdc:dobj-11-4wav0343, nmdc:dobj-11-t18a6s46, nmdc:dobj-11-ab31vd45, nmdc:dobj-11-edmb7126, nmdc:dobj-11-mw936884, nmdc:dobj-11-jqxqpv95, nmdc:dobj-11-3xkcbw58nmdc:omprc-11-by9r5p41v1.3.16

This query resulted in all the workflows associated with the DataGeneration identifiers. We can investigate the different types of workflows associated below.

In [15]:
unique(metagenome_annotation_df$type)
  1. 'nmdc:MagsAnalysis'
  2. 'nmdc:MetagenomeAnnotation'
  3. 'nmdc:MetagenomeAssembly'
  4. 'nmdc:ReadBasedTaxonomyAnalysis'
  5. 'nmdc:ReadQcAnalysis'

We are only interested in the outputs of the MetagenomeAnnotation workflow, so we'll filter our dataframe to only these results and clean it up for the next step

In [16]:
metagenome_annotation_df <- metagenome_annotation_df %>%
    filter(type == "nmdc:MetagenomeAnnotation") %>%
    unnest(
        cols = c(
            was_informed_by,
            has_output
        ), names_sep = "_") %>%
    rename(metagenome_annotation_id = id,
           data_generation_id = was_informed_by,
           matagenome_annotation_has_output = has_output,
           workflow_type = type) %>%
    distinct()
head(metagenome_annotation_df)
nrow(metagenome_annotation_df)
A tibble: 6 × 5
metagenome_annotation_idworkflow_typematagenome_annotation_has_outputdata_generation_idversion
<chr><chr><chr><chr><chr>
nmdc:wfmgan-11-05cdqw41.1nmdc:MetagenomeAnnotationnmdc:dobj-11-ndsyd761nmdc:dgns-11-ekte1238v1.1.5
nmdc:wfmgan-11-05cdqw41.1nmdc:MetagenomeAnnotationnmdc:dobj-11-ss1k0e30nmdc:dgns-11-ekte1238v1.1.5
nmdc:wfmgan-11-05cdqw41.1nmdc:MetagenomeAnnotationnmdc:dobj-11-m5g68x38nmdc:dgns-11-ekte1238v1.1.5
nmdc:wfmgan-11-05cdqw41.1nmdc:MetagenomeAnnotationnmdc:dobj-11-22449330nmdc:dgns-11-ekte1238v1.1.5
nmdc:wfmgan-11-05cdqw41.1nmdc:MetagenomeAnnotationnmdc:dobj-11-t33sqd79nmdc:dgns-11-ekte1238v1.1.5
nmdc:wfmgan-11-05cdqw41.1nmdc:MetagenomeAnnotationnmdc:dobj-11-ykw2tv02nmdc:dgns-11-ekte1238v1.1.5
4509

Merge the metagenome annotation data with the biosample, pooling, processed sample, extraction, processed sample, library preparation, processed sample, data generation, and processed sample data.

In [17]:
biosample_df9 <- biosample_df8 %>%
    left_join(metagenome_annotation_df, by = join_by(data_generation_id), relationship = "many-to-many") %>%
    distinct() %>%
    filter(!is.na(metagenome_annotation_id))
head(biosample_df9)
A tibble: 6 × 15
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idprocessed_sample_id2extraction_idprocessed_sample_id3library_preparation_iddata_generation_idmetagenome_annotation_idworkflow_typematagenome_annotation_has_outputversion
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-0nmx9476v1.1.5
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-jqtxmn44v1.1.5
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-xhrek759v1.1.5
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-hcgzc591v1.1.5
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-er6q4n72v1.1.5
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-jg6awk48v1.1.5

7. Get data objects from the metagenome activity result outputs¶

We query the data_object_set using the metagenome_annotation_has_output identifiers to match the id field in the data objects. We then filter the results for only those results with a data_object_type of Scaffold Lineage tsv (since this has contig taxonomy results). Note that the url is a new field returned that contains the tsvs we will need for the final analysis.

In [18]:
data_object_df <- get_results_by_id(
    collection = 'data_object_set',
    match_id_field = 'id',
    id_list = unique(biosample_df9$matagenome_annotation_has_output),
    fields = 'id,data_object_type,url',
    max_page_size = 50
)

# Filter the data object results to only include the Scaffold Lineage tsv files
data_object_df <- data_object_df %>%
    rename(data_object_id = id) %>%
    filter(data_object_type == 'Scaffold Lineage tsv')
head(data_object_df)
A data.frame: 6 × 3
data_object_iddata_object_typeurl
<chr><chr><chr>
1nmdc:dobj-11-f0shbt75Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-769ab655/nmdc:wfmgan-11-zk5aec25.1/nmdc_wfmgan-11-zk5aec25.1_scaffold_lineage.tsv
2nmdc:dobj-11-jg6awk48Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-63ajbd04/nmdc:wfmgan-11-canf7544.1/nmdc_wfmgan-11-canf7544.1_scaffold_lineage.tsv
3nmdc:dobj-11-jp45gr33Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-597mc608/nmdc:wfmgan-11-3s20yk38.1/nmdc_wfmgan-11-3s20yk38.1_scaffold_lineage.tsv
4nmdc:dobj-11-mmv19z03Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-597mc608/nmdc:wfmgan-11-3s20yk38.2/nmdc_wfmgan-11-3s20yk38.2_scaffold_lineage.tsv
5nmdc:dobj-11-bzzn7310Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:dgns-11-jxh3ht55/nmdc:wfmgan-11-nk9mgk94.1/nmdc_wfmgan-11-nk9mgk94.1_scaffold_lineage.tsv
6nmdc:dobj-11-zv2esq19Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:dgns-11-j3w06312/nmdc:wfmgan-11-5bvrev52.1/nmdc_wfmgan-11-5bvrev52.1_scaffold_lineage.tsv

Merge the data object data with the biosample, pooling, processed sample, extraction, processed sample, library preparation, processed sample, data generation, processed sample, and metagenome annotation data

In [19]:
biosample_df10 <- biosample_df9 %>%
    rename(data_object_id = matagenome_annotation_has_output) %>%
    left_join(data_object_df, by = join_by(data_object_id)) %>%
    distinct() %>%
    filter(!is.na(url))
head(biosample_df10)
A tibble: 6 × 17
biosample_idsoil_horizongeo_loc_namegeo_loc_name_typepooling_idprocessed_sample_idprocessed_sample_id2extraction_idprocessed_sample_id3library_preparation_iddata_generation_idmetagenome_annotation_idworkflow_typedata_object_idversiondata_object_typeurl
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-00m15h97M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-gxv2dy50nmdc:procsm-11-ytthx235nmdc:procsm-11-cd8pg312nmdc:extrp-11-c0kyyp83nmdc:procsm-11-jrykhg31nmdc:libprp-11-2szbj346nmdc:omprc-11-63ajbd04nmdc:wfmgan-11-canf7544.1nmdc:MetagenomeAnnotationnmdc:dobj-11-jg6awk48v1.1.5Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-63ajbd04/nmdc:wfmgan-11-canf7544.1/nmdc_wfmgan-11-canf7544.1_scaffold_lineage.tsv
nmdc:bsm-11-06ta8e31M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-5e2asm75nmdc:procsm-11-5s07gt34nmdc:procsm-11-edpstj65nmdc:extrp-11-76s2tz21nmdc:procsm-11-tq69qx97nmdc:libprp-11-pqjwcw60nmdc:omprc-11-769ab655nmdc:wfmgan-11-zk5aec25.1nmdc:MetagenomeAnnotationnmdc:dobj-11-f0shbt75v1.1.5Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-769ab655/nmdc:wfmgan-11-zk5aec25.1/nmdc_wfmgan-11-zk5aec25.1_scaffold_lineage.tsv
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041nmdc:procsm-11-v7s6qh96nmdc:libprp-11-4qwse385nmdc:omprc-11-597mc608nmdc:wfmgan-11-3s20yk38.1nmdc:MetagenomeAnnotationnmdc:dobj-11-jp45gr33v1.0.4Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-597mc608/nmdc:wfmgan-11-3s20yk38.1/nmdc_wfmgan-11-3s20yk38.1_scaffold_lineage.tsv
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountains nmdc:TextValuenmdc:poolp-11-qq41ss20nmdc:procsm-11-ez7edj21nmdc:procsm-11-p09rbx50nmdc:extrp-11-faz6a041nmdc:procsm-11-v7s6qh96nmdc:libprp-11-4qwse385nmdc:omprc-11-597mc608nmdc:wfmgan-11-3s20yk38.2nmdc:MetagenomeAnnotationnmdc:dobj-11-mmv19z03v1.0.5Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-597mc608/nmdc:wfmgan-11-3s20yk38.2/nmdc_wfmgan-11-3s20yk38.2_scaffold_lineage.tsv
nmdc:bsm-11-0asn5d63M horizonUSA: Colorado, Central Plains Experimental Rangenmdc:TextValuenmdc:poolp-11-pak1ws91nmdc:procsm-11-y8w3sk61nmdc:procsm-11-q086v208nmdc:extrp-11-9qd5ke92nmdc:procsm-11-jkvhv341nmdc:libprp-11-24s1rh35nmdc:dgns-11-j3w06312 nmdc:wfmgan-11-5bvrev52.1nmdc:MetagenomeAnnotationnmdc:dobj-11-zv2esq19v1.1.5Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:dgns-11-j3w06312/nmdc:wfmgan-11-5bvrev52.1/nmdc_wfmgan-11-5bvrev52.1_scaffold_lineage.tsv
nmdc:bsm-11-0djp2e45M horizonUSA: Colorado, North Sterling nmdc:TextValuenmdc:poolp-11-vfkwpy98nmdc:procsm-11-258vbz70nmdc:procsm-11-nrrknt87nmdc:extrp-11-qg3zf244nmdc:procsm-11-t397mj03nmdc:libprp-11-p07zpd31nmdc:dgns-11-jxh3ht55 nmdc:wfmgan-11-nk9mgk94.1nmdc:MetagenomeAnnotationnmdc:dobj-11-bzzn7310v1.1.5Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:dgns-11-jxh3ht55/nmdc:wfmgan-11-nk9mgk94.1/nmdc_wfmgan-11-nk9mgk94.1_scaffold_lineage.tsv

Clean up the combined results¶

In the final step of retrieving and cleaning the data, we clean up the final merged data frame by removing all of the "joining columns" that are not needed in our final analysis. Because some of the biosamples were pooled, we only retain unique url results (and drop the biosample_id column). The only columns we retain are the soil_horizon, geo_loc_name, and the url to the tsv. The final_df is displayed. We also filter to a single workflow version to avoid including duplicate data from the same samples that were rerun with workflow releases.

In [20]:
biosample_df_final <- biosample_df10 %>%
    select(biosample_id, soil_horizon, geo_loc_name, data_object_id, data_object_type, url, version) %>%
    distinct() %>%
    filter(!is.na(url)) %>%
    # Filter to a single workflow version
    filter(version == "v1.0.4")
head(biosample_df_final)
A tibble: 6 × 7
biosample_idsoil_horizongeo_loc_namedata_object_iddata_object_typeurlversion
<chr><chr><chr><chr><chr><chr><chr>
nmdc:bsm-11-06tgpb52O horizonUSA: Colorado, Rocky Mountainsnmdc:dobj-11-jp45gr33Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-597mc608/nmdc:wfmgan-11-3s20yk38.1/nmdc_wfmgan-11-3s20yk38.1_scaffold_lineage.tsvv1.0.4
nmdc:bsm-11-0gmd9f09M horizonUSA: Colorado, Niwot Ridge nmdc:dobj-11-xt9amn82Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-gaptm502/nmdc:wfmgan-11-me7h8h69.1/nmdc_wfmgan-11-me7h8h69.1_scaffold_lineage.tsvv1.0.4
nmdc:bsm-11-0hz4rd27O horizonUSA: Colorado, Niwot Ridge nmdc:dobj-11-8ybd1f87Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-t2fdcy08/nmdc:wfmgan-11-2a0ap078.1/nmdc_wfmgan-11-2a0ap078.1_scaffold_lineage.tsvv1.0.4
nmdc:bsm-11-0qa78w81M horizonUSA: Colorado, North Sterling nmdc:dobj-11-wn5g7j41Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-yt96hb84/nmdc:wfmgan-11-hv3nyk36.1/nmdc_wfmgan-11-hv3nyk36.1_scaffold_lineage.tsvv1.0.4
nmdc:bsm-11-0yw1rj05M horizonUSA: Colorado, North Sterling nmdc:dobj-11-b6yhf780Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-r5w9js52/nmdc:wfmgan-11-k5a19412.1/nmdc_wfmgan-11-k5a19412.1_scaffold_lineage.tsvv1.0.4
nmdc:bsm-11-13kvnw34M horizonUSA: Colorado, Niwot Ridge nmdc:dobj-11-wh0g6g42Scaffold Lineage tsvhttps://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsvv1.0.4

Show how many results have M horizon vs. O horizon¶

The soil_horizon column can be counted using the count() functionality. There are many more samples with associated Scaffold Lineage tsvs from a M horizon biosample than a O horizon.

In [21]:
biosample_df_final %>%
    count(soil_horizon)
A tibble: 2 × 2
soil_horizonn
<chr><int>
M horizon198
O horizon 43

Example of what the TSV contig taxa file looks like¶

A snippet of the TSV file we need to iterate over to get the taxa abundance for the contigs is shown below. The third column is the initial count for the taxa, where each row is 1.0. However, there are duplicate rows of taxa, meaning there are actually more than 1.0 for several taxa (though they appear as duplicate rows with 1.0 as the count`). We will take this into consideration when we calculate the relative abundance for each taxa.

In [22]:
url <- biosample_df_final$url[1]

# Read the TSV file
contig_taxa_df <- read_tsv(url, col_names = FALSE, show_col_types = FALSE)

# Add column names 
colnames(contig_taxa_df) <- c('contig_id', 'taxa', 'initial_count')

# Show the first few rows
head(contig_taxa_df)
A tibble: 6 × 3
contig_idtaxainitial_count
<chr><chr><dbl>
nmdc:wfmgas-11-qdbye406.1_scf_10000_c1Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Nitrobacteraceae;Bradyrhizobium;Bradyrhizobium sp. KBS0725;Bradyrhizobium sp. KBS0725 1
nmdc:wfmgas-11-qdbye406.1_scf_10001_c1Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Propylenellaceae;Propylenella;Propylenella binzhouense;Propylenella binzhouense L72 1
nmdc:wfmgas-11-qdbye406.1_scf_10002_c1Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Nitrobacteraceae;Tardiphaga;Tardiphaga robiniae;Tardiphaga robiniae 1155 1
nmdc:wfmgas-11-qdbye406.1_scf_10003_c1Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Nitrobacteraceae;Bradyrhizobium;Bradyrhizobium lablabi;Bradyrhizobium lablabi GAS165 1
nmdc:wfmgas-11-qdbye406.1_scf_10004_c1Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Nitrobacteraceae;Bradyrhizobium;Bradyrhizobium sp. SRL28;Bradyrhizobium sp. SRL28 1
nmdc:wfmgas-11-qdbye406.1_scf_10005_c1Bacteria;Pseudomonadota;Alphaproteobacteria;Hyphomicrobiales;Nitrobacteraceae;Bradyrhizobium;Bradyrhizobium sp. AUGA SZCCT0283;Bradyrhizobium sp. AUGA SZCCT02831

Randomly select a subset of samples for which to download taxonomy files¶

As more data is added to NMDC, downloading all of the results that meet the criteria for this notebook becomes a larger and slower task. Here we will randomly sample a set number of datasets from each horizon to use for the rest of the notebook to improve performance.

In [23]:
set.seed(413)
urls <- biosample_df_final %>%
  distinct(url, .keep_all = TRUE) %>%
  group_by(soil_horizon) %>%
  slice_sample(n = 15) %>%
  ungroup() %>%
  pull(url)

Iterate throught the TSVs to get the contig taxa information¶

Using the readr's read_tsv function, the TSV urls can be iterated over gathering the taxa information. The TSVs are converted into dataframes where they are manipulated to suit the data structure needed. The columns are given names and the taxa column is split into a proper list (instead of a string of items separated by a semicolon ;). The third element from the list of taxa is retrieved to get only the phylum level information of the taxa (or unknown to the highest available taxon). A grouping function is performed on the taxa column and the count() functionality is used to calculate the count for how many times each taxa occurs, which is then used to calculate the relative abundance of each taxa for each sample.

Any errors in requesting the TSV urls are collected as a dictionary, so we can either try to query them again, or look into why they were not able to be collected.

In [24]:
#urls <- unique(biosample_df_final$url)
results_list <- c()
error_dict <- list()

for (i in 1:length(urls)) {
    # if i a factor of 100, print the progress
    if (i %% 10 == 0) {
        print(paste('Processing', i, 'of', length(urls)))
    }
    url <- urls[i]
    tryCatch({
        contig_taxa_df <- read_tsv(url, col_names = FALSE, show_col_types = FALSE)
        colnames(contig_taxa_df) <- c('contig_id', 'taxa', 'initial_count')
        
        # Clean up the taxa column and deal with unknown taxa
        contig_taxa_df$taxa_new <- contig_taxa_df$taxa
        contig_taxa_df$taxa_new <- sapply(strsplit(contig_taxa_df$taxa_new, ';'), function(x) x[3])
        contig_taxa_df$taxa_new <- ifelse(
            is.na(contig_taxa_df$taxa_new), 
            paste('Unknown', sapply(strsplit(contig_taxa_df$taxa, ';'), function(x) x[2])), 
            contig_taxa_df$taxa_new)
        contig_taxa_df$taxa_new <- ifelse(
            contig_taxa_df$taxa_new == "Unknown NA", 
            paste('Unknown', sapply(strsplit(contig_taxa_df$taxa, ';'), function(x) x[1])), 
            contig_taxa_df$taxa_new)
        contig_taxa_df$taxa <- contig_taxa_df$taxa_new

        contig_taxa_df <- contig_taxa_df %>%
            group_by(taxa) %>%
            summarise(count = n()) %>%
            mutate(relative_abundance = count / sum(count))

        # Add the queried url to the dataframe for later joining
        contig_taxa_df$url <- url
        results_list[[i]] <- contig_taxa_df

    }, error = function(e) {
        error_dict[[i]] <- e
    })
}

# Combine results into single dataframe
contig_df <- bind_rows(results_list) 

head(contig_df)
[1] "Processing 10 of 30"
[1] "Processing 20 of 30"
[1] "Processing 30 of 30"
A tibble: 6 × 4
taxacountrelative_abundanceurl
<chr><int><dbl><chr>
Acidimicrobiia 450.0024050024https://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsv
Acidithiobacillia 100.0005344450https://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsv
Actinomycetes 30570.1633798300https://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsv
Agaricomycetes 70.0003741115https://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsv
Alphaproteobacteria18210.0973224307https://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsv
Anaerolineae 230.0012292235https://data.microbiomedata.org/data/nmdc:omprc-11-qxq15411/nmdc:wfmgan-11-9v130b74.1/nmdc_wfmgan-11-9v130b74.1_scaffold_lineage.tsv

Clean up the relative abundance data to fill in NAs with 0 for unobserved taxa¶

In [25]:
# First merge to get the url for geo_loc_name and soil_horizon
biosample_taxa_df <- biosample_df_final %>%
    select(soil_horizon, geo_loc_name, url) %>%
    distinct() %>%
    right_join(contig_df, by = join_by(url))

# Then pivot the table to fill in the relative abundance as zero for un-observed taxa
biosample_taxa_df_wide <- biosample_taxa_df %>%
    pivot_wider(id_cols = c(url, soil_horizon, geo_loc_name),
        names_from = taxa, values_from = relative_abundance) %>%
    replace(is.na(.), 0)

# And unpivot the table to get the taxa relative abundance for each biosample
biosample_taxa_df <- biosample_taxa_df_wide %>%
    pivot_longer(cols = -c(url, soil_horizon, geo_loc_name), names_to = 'taxa', values_to = 'relative_abundance')

Plot the average taxa abundance for all M and O horizon soil samples¶

First calculate the average relative abundance for each taxa in each soil horizon. Next, we'll pull out the top ten taxa and lump all others into an "Other" category for plotting purposes using the forcats::fct_other function. Then we'll calculate the mean relative abundance of each taxa for each soil horizon. Finally, we'll choose an appropriate color palette for the plot, and plot the relative abundance of each taxa for each soil horizon at each location.

In [26]:
options(dplyr.summarise.inform = FALSE)
horizon_taxa <- biosample_taxa_df %>%
    group_by(soil_horizon, taxa) %>%
    summarise(mean_relative_abundance = mean(relative_abundance))%>%
    arrange(mean_relative_abundance) %>%
    mutate(taxa = factor(taxa, levels = rev(unique(taxa)))) %>%
    mutate(taxa_lump = forcats::fct_other(taxa, keep = levels(taxa)[1:15], other_level = 'Other')) 
           
# Make color palette that is 9 colors long, and followed with grey
color_pal <- c(RColorBrewer::brewer.pal(8, 'Set1'), RColorBrewer::brewer.pal(7, 'Set3'), 'grey')
g <- ggplot(horizon_taxa, aes(x = soil_horizon, y = mean_relative_abundance, fill = taxa_lump)) +
    geom_bar(stat = 'identity', color = NA) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(title = 'Taxa abundance of M and O horizon soil samples',
         x = 'Soil Horizon', y = 'Mean relative abundance', fill = NULL) +
    scale_fill_manual(values = color_pal) +
    theme_minimal() 
options(repr.plot.width = 7, repr.plot.height = 7, repr.plot.res = 250)
g
No description has been provided for this image

Plot the taxa abundance of M and O horizon soil samples for each location¶

First we'll pull out the top ten taxa and lump all others into an "Other" category for plotting purposes using the forcats::fct_other function. Then we'll calculate the mean relative abundance of each taxa for each soil horizon for each location. Finally, we'll plot the relative abundance of each taxa for each soil horizon at each location (using the same color palette as above).

In [27]:
geo_taxa <- biosample_taxa_df %>%
    group_by(geo_loc_name, soil_horizon, taxa) %>%
    summarise(mean_relative_abundance = mean(relative_abundance)) %>%
    arrange(mean_relative_abundance) %>%
    mutate(taxa = factor(taxa, levels = rev(unique(taxa)))) %>%
    mutate(taxa_lump = forcats::fct_other(taxa, keep = levels(taxa)[1:15], other_level = 'Other')) %>%
    mutate(soil_horizon = factor(soil_horizon, levels = c('M horizon', 'O horizon'), labels = c('M', 'O')))

g <- ggplot(geo_taxa, aes(x = soil_horizon, y = mean_relative_abundance, fill = taxa_lump)) +
    geom_bar(stat = 'identity', color = NA) +
    facet_wrap(~geo_loc_name, nrow = 1,labeller =  label_wrap_gen(width = 20, multi_line = TRUE)) +
    labs(title = 'Taxa abundance of M and O horizon soil samples for each location',
         x = 'Soil Horizon', y = 'Mean relative abundance', fill = NULL) +
    scale_fill_manual(values = color_pal) +
    theme_minimal()+
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          legend.position = "bottom") 
options(repr.plot.width = 9, repr.plot.height = 7, repr.plot.res = 250)
g
No description has been provided for this image