Bio-scales metadata biogeochemical exploration and visualization¶

This notebook uses metadata from the Bio-scales study in the NMDC Data Portal to look at how soil biogeochemical properties from a Botanical Garden ecosystem_subtype are related.

Import Python libraries¶

In [3]:
import requests
import pandas as pd
import plotly.express as px
import nmdc_api_utilities

Get the study ID for the Bio-Scales study¶

Using the nmdc_api_utilties we can retrieve the study ID for the Bio-Scales study. In this instance we use the StudySearch get_record_by_attribute function to look for studies with "Bio-Scales" in the title.

In [4]:
from nmdc_api_utilities.study_search import StudySearch
# Create a StudySearch object
ss_client = StudySearch(env=ENV)
# Get the study by title
study_json = ss_client.get_record_by_attribute(attribute_name="title", attribute_value="Bio-Scales")
# Get the study id from the study_json
study = study_json[0]["id"]
print(study)
nmdc:sty-11-r2h77870

Get the biosamples associated with the Bio-Scales study¶

Using the study ID obtained above, and BiosampleSearch module, the biosamples can be filtered by study ID using the associated_studies field. The information returned can be limited to the fields we are interested in. In this case, the fields are ph, magnesium, nitrite_nitrogen, etc. We pass in the parameter all_pages=True to tell the module to return all available results. Here, the total number of biosamples associated with the Bio-Scales study is printed, along with one of the biosamples with all the given fields.

In [5]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
# create a BiosampleSearch object
bs_client = BiosampleSearch(env=ENV)
per_page = 2000 
fields = "ph,calcium,magnesium,potassium,tot_nitro,manganese,zinc,ammonium_nitrogen,nitrate_nitrogen,nitrite_nitrogen,ecosystem_subtype,habitat"
# get the biosample by associated study, setting all_pages to True to ensure we get all biosamples with the given study
biosample_json = bs_client.get_record_by_attribute(attribute_name="associated_studies", attribute_value=study, fields=fields, max_page_size=per_page, all_pages=True)
print(f"Total number of biosamples: {len(biosample_json)}")
# find biosample nmdc:bsm-11-08mamh62
for biosample in biosample_json:
    if biosample["id"] == "nmdc:bsm-11-08mamh62":
        print(biosample)
        break
Total number of biosamples: 416
{'id': 'nmdc:bsm-11-08mamh62', 'ecosystem_subtype': 'Botanical garden', 'habitat': 'Soil', 'ph': 6.64, 'calcium': {'has_raw_value': '1945.98 mg/kg', 'has_numeric_value': 1945.98, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'magnesium': {'has_raw_value': '351.875 mg/kg', 'has_numeric_value': 351.875, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'potassium': {'has_raw_value': '304.965 mg/kg', 'has_numeric_value': 304.965, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'tot_nitro': {'has_raw_value': '0.242 Percent', 'has_numeric_value': 0.242, 'has_unit': '%', 'type': 'nmdc:QuantityValue'}, 'manganese': {'has_raw_value': '17.688 mg/kg', 'has_numeric_value': 17.688, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'zinc': {'has_raw_value': '2.6576 mg/kg', 'has_numeric_value': 2.6576, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'ammonium_nitrogen': {'has_raw_value': '1.8015 mg/kg', 'has_numeric_value': 1.8015, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'nitrate_nitrogen': {'has_raw_value': '1.698 mg/kg', 'has_numeric_value': 1.698, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'nitrite_nitrogen': {'has_raw_value': '0 mg/kg', 'has_numeric_value': 0.0, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}}

Explore and understand the results¶

Since all the biosamples retrieved do not contain all of the fields we are interested in (listed in the filt variable), creating a list of the fields to loop through allows us to count how many biosamples contain those fields. The number of biosamples for each field is printed.

In [6]:
# convert string of fields from request above to dictionary with values set to 0
fields_list = fields.split(',')
field_counts = {field: 0 for field in fields_list}

# Loop through the list of fields and the results to count the presence of each field
for field in field_counts.keys():
    for samp in biosample_json:
        if field in samp:
            field_counts[field] += 1
            
print(field_counts)
{'ph': 103, 'calcium': 103, 'magnesium': 103, 'potassium': 103, 'tot_nitro': 103, 'manganese': 103, 'zinc': 103, 'ammonium_nitrogen': 103, 'nitrate_nitrogen': 103, 'nitrite_nitrogen': 103, 'ecosystem_subtype': 416, 'habitat': 416}

Drop the rows missing biogeochemical fields¶

Using a list comprehension, we can create a list of filtered_results that return only the biosamples that include the complete biogeochemical metadata we are interested in. The number of filtered results is printed along with the first few results.

In [7]:
filtered_results = [biosamp for biosamp in biosample_json if all(field in biosamp for field in fields_list)]
print(f"Total results after filtering for all fields: {len(filtered_results)}")
print(filtered_results[2])
Total results after filtering for all fields: 103
{'id': 'nmdc:bsm-11-bdn1fa14', 'ecosystem_subtype': 'Botanical garden', 'habitat': 'Soil', 'ph': 6.23, 'calcium': {'has_raw_value': '2596 mg/kg', 'has_numeric_value': 2596.0, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'magnesium': {'has_raw_value': '456.241 mg/kg', 'has_numeric_value': 456.241, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'potassium': {'has_raw_value': '154.895 mg/kg', 'has_numeric_value': 154.895, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'tot_nitro': {'has_raw_value': '0.172 Percent', 'has_numeric_value': 0.172, 'has_unit': '%', 'type': 'nmdc:QuantityValue'}, 'manganese': {'has_raw_value': '25.9704 mg/kg', 'has_numeric_value': 25.9704, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'zinc': {'has_raw_value': '2.112 mg/kg', 'has_numeric_value': 2.112, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'ammonium_nitrogen': {'has_raw_value': '4.844 mg/kg', 'has_numeric_value': 4.844, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'nitrate_nitrogen': {'has_raw_value': '2.1 mg/kg', 'has_numeric_value': 2.1, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}, 'nitrite_nitrogen': {'has_raw_value': '0.4245 mg/kg', 'has_numeric_value': 0.4245, 'has_unit': 'mg/kg', 'type': 'nmdc:QuantityValue'}}

Transform results¶

The filtered_results are converted to a data frame while also transforming the field values into the desired data types. For example, all decimal values are converted to floats. Looking at the structure of the results above, we can see that most of the biogeochemical fields have a nested value structure that include has_raw_value, has_numeric_value and has_unit. Four fields do not include this nested structure: ecosystem_subtype, habitat, id, and ph. Separating the two structures into different lists will help clarify the desired data types. We can also extract out the units using the has_unit sub-field of some of the fields, which will come in handy when we create graphs. The resulting data frame is shown.

In [8]:
# Make list of all fields with "has_raw_value" sub-field and remove the rest
other_fields = ["ecosystem_subtype", "habitat", "id", "ph"]
raw_value_fields = list(filter(lambda field: field not in other_fields, fields_list))

# create dictionary of raw_value_fields with values as empty lists
units = {field: [] for field in raw_value_fields}

# Add filtered results to a new list of dictionaries with desired data types, extract out units into a separate dictionary 
df_inp = []
for biosamp in filtered_results:
    rec = {}
    for field in raw_value_fields:
        rec[field] = float(biosamp[field]["has_numeric_value"])
        units[field].append(biosamp[field]["has_unit"])
    for other_field in other_fields:
        if other_field == "ph":
            rec[other_field] = float(biosamp[other_field])
        else:
            rec[other_field] = biosamp[other_field]
    df_inp.append(rec)

# Convert list of results dictionaries to a data frame
df = pd.DataFrame(df_inp)
df 
Out[8]:
calcium magnesium potassium tot_nitro manganese zinc ammonium_nitrogen nitrate_nitrogen nitrite_nitrogen ecosystem_subtype habitat id ph
0 2774.35 578.148 168.200 0.598 48.0908 13.1545 14.5245 20.4010 0.0000 Botanical garden Soil nmdc:bsm-11-c9458s26 5.41
1 2511.03 502.777 315.919 0.131 26.6171 4.0774 2.6825 0.0000 0.0000 Botanical garden Soil nmdc:bsm-11-r7rgv593 6.95
2 2596.00 456.241 154.895 0.172 25.9704 2.1120 4.8440 2.1000 0.4245 Botanical garden Soil nmdc:bsm-11-bdn1fa14 6.23
3 1841.93 331.500 113.561 0.470 34.8639 5.2868 9.5715 12.8740 0.0000 Botanical garden Soil nmdc:bsm-11-ftte8s50 5.14
4 2320.83 458.940 489.767 0.326 20.1285 2.6380 2.3290 2.8040 0.0000 Botanical garden Soil nmdc:bsm-11-01teww33 6.81
... ... ... ... ... ... ... ... ... ... ... ... ... ...
98 1846.09 291.535 168.937 0.513 23.2667 5.1776 16.3295 14.3950 0.0000 Botanical garden Soil nmdc:bsm-11-nk3r8t63 4.84
99 2449.89 450.242 255.944 0.374 26.1187 4.4023 3.9165 2.0535 0.0000 Botanical garden Soil nmdc:bsm-11-2mb94m91 6.37
100 2059.81 356.661 192.340 0.619 26.1871 4.9097 14.0990 5.7905 0.0000 Botanical garden Soil nmdc:bsm-11-9k1chb38 5.03
101 1764.86 312.300 222.738 0.250 14.6228 3.4332 2.1955 0.8360 0.0000 Botanical garden Soil nmdc:bsm-11-ewmanm69 6.48
102 2032.06 305.976 268.098 0.417 43.5453 4.2006 10.4270 10.2510 0.0000 Botanical garden Soil nmdc:bsm-11-ygtdv867 5.00

103 rows × 13 columns

Understand the units¶

Using Python's set() built-in function, we can iterate through the units list and remove redundant units to understand which units are used for each field, ultimately creating a dictionary, printed below

In [9]:
units = {field: set(unit_list) for field, unit_list in units.items()}
print(f"Units for each applicable measurement: {units}")
Units for each applicable measurement: {'calcium': {'mg/kg'}, 'magnesium': {'mg/kg'}, 'potassium': {'mg/kg'}, 'tot_nitro': {'%'}, 'manganese': {'mg/kg'}, 'zinc': {'mg/kg'}, 'ammonium_nitrogen': {'mg/kg'}, 'nitrate_nitrogen': {'mg/kg'}, 'nitrite_nitrogen': {'mg/kg'}}

Visualize potassium and pH¶

Using Python's Plotly library, a scatter plot of how potassium changes with ph for the biosamples is created. Setting the trendline equal to ols, a linear regression is added.

In [10]:
# Look at potassium vs. ph
fig = px.scatter(df, x="potassium", y="ph", trendline = "ols")
fig.show()
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/kaleido/_sync_server.py:11: UserWarning:



Warning: You have Plotly version 5.18.0, which is not compatible with this version of Kaleido (1.2.0).

This means that static image generation (e.g. `fig.write_image()`) will not work.

Please upgrade Plotly to version 6.1.1 or greater, or downgrade Kaleido to version 0.2.1.


Scatter plot of ammonium nitrogen vs. nitrate nitrogen¶

A scatter plot using the same methods above is used again, but to display how ammonium_nitrogen changes with nitrate_nitrogen. Setting log_x to True displays the results on a logarithmic scale.

In [11]:
fig = px.scatter(df, x="ammonium_nitrogen", y="nitrate_nitrogen", trendline = "ols", log_x=True)
fig.show()

Scatter matrix of all the biogeochemicals¶

A scatter matrix can be used to visuzlize all of the biogeochemical fields relationship to each other at once. To fit the labels nicely on the graph, the element's symbol is used. We also concatenate the units to the corresponding element onto the labels.

In [12]:
# Add shortened names of elements and applicable units to a label_mapping dictionary
label_mapping = {"calcium": "Ca", "magnesium": "Mg", "manganese": "Mn", "zinc": "Zn", "potassium": "K"}
for elem, unit in units.items():
    if elem in label_mapping:
        label_mapping[elem] = label_mapping[elem] + " " + (str(unit)).replace("{'","(").replace("'}", ")")


fig = px.scatter_matrix(df,
    dimensions = ["calcium", "magnesium", "manganese", "calcium", "zinc", "potassium"],
    title = "Scatter matrix of Bioscales' biogeochemicals",
    labels = label_mapping,
    color = "ecosystem_subtype",
    width = 800,
    height = 800)
fig.update_traces(diagonal_visible=False)
fig.show()