Bio-scales metadata biogeochemical exploration and visualization¶
This notebook uses metadata from the Bio-scales study in the NMDC Data Portal to look at how soil biogeochemical properties from a Botanical Garden ecosystem_subtype are related.
Get study IDs associated with Bio-Scales sites using API¶
Using the Python requests library and NMDC studies find endpoint, we can retrieve the study ID for the Bio-Scales study. In this instance we use a .search in the filter parameter to look for studies with "Bio-Scales" in the title. More information regarding the API can be found here
base_url = "https://api.microbiomedata.org"
url = paste0(base_url, "/studies?filter=title.search:Bio-Scales")
response = fromJSON(url)
study_ids = response[["results"]][["id"]]
print(study_ids)
[1] "nmdc:sty-11-r2h77870"
Using the study ids, pull out bio sample IDs¶
Note that we are pulling 2000 records at a time until we have retrieved all biosamples for the three study ids above, place the data retrieved for each bio sample into a tibble.
study_id = study_ids[1]
# Prepare a tibble to hold results
dat_all = tibble()
# Set up query
per_page = 2000 # number of records to retrieve per page
filt = paste0("associated_studies:", study_id) # filter to only get biosamples from the study of interest
get_more = TRUE # flag to indicate whether we need to get more records
page = 1 # page number to retrieve
while (get_more){
# construct the url for the query
url = paste0(
base_url,
"/biosamples?filter=",
filt,
"&per_page=",
per_page,
"&page=",
page)
# get the data
data = fromJSON(url)
data_results = data[['results']] %>% as.data.frame()
# add the data to the tibble
dat_all = bind_rows(dat_all, data_results)
# check if we need to get more records
if (nrow(dat_all) < data[['meta']]['count']){
page = page +1
} else { get_more = FALSE}
}
glimpse(dat_all)
Rows: 416 Columns: 43 $ id <chr> "nmdc:bsm-11-6zd5nb38", "nmdc:bsm-11-ahpvvb… $ name <chr> "Rhizosphere soil microbial communities fro… $ description <chr> "Rhizosphere soil microbial communities fro… $ env_broad_scale <df[,3]> <data.frame[26 x 3]> $ env_local_scale <df[,3]> <data.frame[26 x 3]> $ env_medium <df[,3]> <data.frame[26 x 3]> $ host_taxid <df[,3]> <data.frame[26 x 3]> $ collected_from <chr> "nmdc:frsite-11-xckqtm38", "nmdc:frsite-… $ type <chr> "nmdc:Biosample", "nmdc:Biosample", "nmd… $ img_identifiers <list> "img.taxon:3300046665", "img.taxon:33000… $ samp_name <chr> "BESC-904-Co3_16_51 rhizosphere", "GW-95… $ gold_biosample_identifiers <list> "gold:Gb0291773", "gold:Gb0291786", "gold:G… $ elev <dbl> 62, 1, 62, 62, 62, 1, 62, 62, 1, 62, 62, 62… $ geo_loc_name <df[,2]> <data.frame[26 x 2]> $ lat_lon <df[,4]> <data.frame[26 x 4]> $ samp_taxon_id <df[,3]> <data.frame[26 x 3]> $ ecosystem <chr> "Host-associated", "Host-associated", "Hos… $ ecosystem_category <chr> "Plants", "Plants", "Plants", "Plants", "Pl… $ ecosystem_type <chr> "Roots", "Roots", "Roots", "Roots", "Roots… $ ecosystem_subtype <chr> "Rhizosphere", "Rhizosphere", "Rhizosphere"… $ specific_ecosystem <chr> "Soil", "Soil", "Soil", "Soil", "Soil", … $ add_date <chr> "2021-05-03T00:00:00", "2021-05-03T00:00… $ habitat <chr> "Rhizosphere soil", "Rhizosphere soil", … $ host_name <chr> "Populus", "Populus", "Populus", "Populus",… $ location <chr> "USA", "USA", "USA", "USA", "USA", "USA", "… $ mod_date <chr> "2021-05-03T00:00:00", "2021-05-03T00:00:00… $ ncbi_taxonomy_name <chr> "rhizosphere metagenome", "rhizosphere meta… $ sample_collection_site <chr> "Rhizosphere Soil", "Rhizosphere Soil", "Rh… $ collection_date <df[,2]> <data.frame[26 x 2]> $ associated_studies <list> "nmdc:sty-11-r2h77870", "nmdc:sty-11-r2h778… $ depth <df[,4]> <data.frame[26 x 4]> $ ph <dbl> NA, NA, NA, NA, NA, 5.41, 6.95, 6.23, 5.14,… $ calcium <df[,4]> <data.frame[26 x 4]> $ magnesium <df[,4]> <data.frame[26 x 4]> $ potassium <df[,4]> <data.frame[26 x 4]> $ tot_nitro <df[,4]> <data.frame[26 x 4]> $ lbc_thirty <df[,4]> <data.frame[26 x 4]> $ lbceq <df[,4]> <data.frame[26 x 4]> $ manganese <df[,4]> <data.frame[26 x 4]> $ zinc <df[,4]> <data.frame[26 x 4]> $ ammonium_nitrogen <df[,4]> <data.frame[26 x 4]> $ nitrate_nitrogen <df[,4]> <data.frame[26 x 4]> $ nitrite_nitrogen <df[,4]> <data.frame[26 x 4]>
Clean up results¶
Pull out biosample_id and associated chemical metadata; unnest as needed
df <- dat_all %>%
select(
id, ecosystem_subtype,
calcium, magnesium, potassium, nitrate_nitrogen, manganese, zinc
) %>%
unnest(
cols = c(
ecosystem_subtype,
calcium, magnesium, potassium, nitrate_nitrogen, manganese, zinc
), names_sep = "_") %>%
select(id, ecosystem_subtype,
contains("has_numeric_value"),
contains("has_unit"))
glimpse(df)
Rows: 416 Columns: 14 $ id <chr> "nmdc:bsm-11-6zd5nb38", "nmdc:bsm-1… $ ecosystem_subtype <chr> "Rhizosphere", "Rhizosphere", "Rhiz… $ calcium_has_numeric_value <dbl> NA, NA, NA, NA, NA, 2774.35, 2511.0… $ magnesium_has_numeric_value <dbl> NA, NA, NA, NA, NA, 578.148, 502.77… $ potassium_has_numeric_value <dbl> NA, NA, NA, NA, NA, 168.200, 315.91… $ nitrate_nitrogen_has_numeric_value <dbl> NA, NA, NA, NA, NA, 20.4010, 0.0000… $ manganese_has_numeric_value <dbl> NA, NA, NA, NA, NA, 48.0908, 26.617… $ zinc_has_numeric_value <dbl> NA, NA, NA, NA, NA, 13.1545, 4.0774… $ calcium_has_unit <chr> NA, NA, NA, NA, NA, "mg/kg", "mg/kg… $ magnesium_has_unit <chr> NA, NA, NA, NA, NA, "mg/kg", "mg/kg… $ potassium_has_unit <chr> NA, NA, NA, NA, NA, "mg/kg", "mg/kg… $ nitrate_nitrogen_has_unit <chr> NA, NA, NA, NA, NA, "mg/kg", "mg/kg… $ manganese_has_unit <chr> NA, NA, NA, NA, NA, "mg/kg", "mg/kg… $ zinc_has_unit <chr> NA, NA, NA, NA, NA, "mg/kg", "mg/kg…
Unit check¶
Check that all units are the same before dropping from dataframe
# Check that all units are the same before dropping from dataframe
unit_check <- df %>%
select(contains("has_unit")) %>%
distinct()
glimpse(unit_check)
Rows: 2 Columns: 6 $ calcium_has_unit <chr> NA, "mg/kg" $ magnesium_has_unit <chr> NA, "mg/kg" $ potassium_has_unit <chr> NA, "mg/kg" $ nitrate_nitrogen_has_unit <chr> NA, "mg/kg" $ manganese_has_unit <chr> NA, "mg/kg" $ zinc_has_unit <chr> NA, "mg/kg"
More dataframe cleaning¶
Since all units are the same, drop from dataframe; rename columns for easier reading and plotting
df <- df %>%
select(-contains("has_unit")) %>%
rename(
calcium = calcium_has_numeric_value,
magnesium = magnesium_has_numeric_value,
potassium = potassium_has_numeric_value,
nitrate = nitrate_nitrogen_has_numeric_value,
manganese = manganese_has_numeric_value,
zinc = zinc_has_numeric_value
)
glimpse(df)
Rows: 416 Columns: 8 $ id <chr> "nmdc:bsm-11-6zd5nb38", "nmdc:bsm-11-ahpvvb55", "nmd… $ ecosystem_subtype <chr> "Rhizosphere", "Rhizosphere", "Rhizosphere", "Rhizos… $ calcium <dbl> NA, NA, NA, NA, NA, 2774.35, 2511.03, 2596.00, 1841.… $ magnesium <dbl> NA, NA, NA, NA, NA, 578.148, 502.777, 456.241, 331.5… $ potassium <dbl> NA, NA, NA, NA, NA, 168.200, 315.919, 154.895, 113.5… $ nitrate <dbl> NA, NA, NA, NA, NA, 20.4010, 0.0000, 2.1000, 12.8740… $ manganese <dbl> NA, NA, NA, NA, NA, 48.0908, 26.6171, 25.9704, 34.86… $ zinc <dbl> NA, NA, NA, NA, NA, 13.1545, 4.0774, 2.1120, 5.2868,…
Plot chemical data in a correlation matrix¶
Create paired correlation matrix using GGally package's ggpairs function
# Drop rows with NAs before plotting to avoid NA warning
df_complete <- na.omit(df)
g <- ggpairs(df_complete,
columns = c(3:7),
title = "Correlation Matrix of Chemicals in Bio-Scales Data",
lower = list(continuous = wrap("points", alpha = 0.5, size = 0.7)),
upper = list(continuous = wrap("cor", size = 3))) +
theme_bw() +
labs(
x = "Chemical Concentration (mg/kg)",
y = "Chemical Concentration (mg/kg)"
)
options(repr.plot.width = 7, repr.plot.height = 7, repr.plot.res = 250)
g