Data exploration and visualization for the soil metagenomes (DP1.10107.001) NEON study¶
This notebook explores how soil pH changes with time for the 47 sites part of the NEON study. It also examines how the average water_content changes by season for each site.
import requests
import json
import pandas as pd
import folium
import altair as alt
from altair import Chart
from datetime import datetime
import nmdc_api_utilities
Get the Study ID for the NEON soil metagenome DP1.10107.001 project¶
Using the Python pacakge 'nmdc_api_utilities', we can get the NMDC study ID based on the project name. More information regarding the nmdc_api_utilities package can be found here. We first create a client able to search across studies (using the "study_set" collections). Other collections of interest may include the biosamples ("biosample_set" collection) and instruments ("instrument_set"). Then, we search for a study which name include our keyword of interest (DP1.10107.001), and we extract its study id.
keyword="DP1.10107.001"
## We import and instantiate a client that is able to search studies
from nmdc_api_utilities.study_search import StudySearch
Study_search_client = StudySearch(env=ENV)
## We know the name of the study contains "DP1.10107.001", but probably other characters, so we search it as part of the whole name with "$regex".
## The function "get_records" would let us apply this filter using MongoDB-like language querying.
# query_results = Study_search_client.get_records(filter=f'{{"name": {{"$regex":"{keyword}"}}}}')
## But we can also use "get_record_by_attribute", which also search using a regular expression
print(f"Searching the collection of studies, looking for a study including {keyword} in its name")
query_results = Study_search_client.get_record_by_attribute("name",keyword)
## We expect a single study to be returned, but we verify that indeed we got a single one in our result list.
n_results=len(query_results)
print(f"We got {n_results} study(ies) matching our query")
## We select our study, and the store its id in study_id
study = query_results[0]
study_id = study["id"]
print(f"The corresponding study_id is {study_id}")
Searching the collection of studies, looking for a study including DP1.10107.001 in its name
We got 1 study(ies) matching our query The corresponding study_id is nmdc:sty-11-34xj1150
Get all the biosamples associated with the NEON soil metagenomes study by the NMDC study ID¶
We can use the NMDC study ID obtained above, and search over the BioSample collection to get a list of biosamples from this study and the NMDC biosamples find endpoint to filter the biosamples in the data portal by the study they are part_of. We can limit the results by the fields we would like returned, such as ph, water_content, etc. Cursor pagination is used to request a large amount of information - only 2000 results per_page are allowed. Here the total number of biosamples associated with the NEON study is printed. A list of all the results (all_results) is created.
## As done for the study search, we instantiate a client to search biosamples, and from there look for all samples with the corresponding study_id
from nmdc_api_utilities.biosample_search import BiosampleSearch
Sample_search_client = BiosampleSearch(env=ENV)
print(f"Searching the collection of biosamples, looking for biosamples associated with study {study_id}")
## When running the search, we specify "max_page_size=500" and "all_pages=True" to get all results in a timely manner (the default would return only the first 20)
all_results = Sample_search_client.get_record_by_attribute("associated_studies",study_id,max_page_size=500,all_pages=True,exact_match=True)
## This is a relatively big query, so it may take ~ 1 minute to return
print(f"We found {len(all_results)} samples linked to study {study_id}")
## This return is a list of sample attributes, which we will further explore below
Searching the collection of biosamples, looking for biosamples associated with study nmdc:sty-11-34xj1150
We found 4475 samples linked to study nmdc:sty-11-34xj1150
Geo-location and collection date exploration¶
Using the returned results, we can explore the number of locations, or sites, in the data using the geo_location field (aka slot) and the lat_lon field, as well as the number of collection dates using the collection_date field. Here, we print the number of distinct coordinates, geo_locations, and dates.
# Find total number of NEON coordinates
# Find the total number of geo locations (equivalent to NEON sites)
# Find the total number of collection dates
coordinates = set()
geo_locs = set()
dates = set()
all_full_results = []
for samp in all_results:
# Check if samp has keys that correspond to primary metadata
if set(['lat_lon', 'geo_loc_name', 'collection_date']).issubset(samp):
lat_lon = f"{samp['lat_lon']['latitude']},{samp['lat_lon']['longitude']}"
coordinates.add(lat_lon)
geo_locs.add(samp["geo_loc_name"]["has_raw_value"])
dates.add(samp["collection_date"]["has_raw_value"])
all_full_results.append(samp)
print(f"Number of NEON coordinates: {len(coordinates)}")
print(f"Number of geo locations (sites): {len(geo_locs)}")
print(f"Number of dates: {len(dates)}")
print(f"Total number of biosamples with sufficient metadata: {len(all_full_results)}")
Number of NEON coordinates: 495 Number of geo locations (sites): 47 Number of dates: 4242 Total number of biosamples with sufficient metadata: 4475
Print a returned result¶
Let's print the first returned result to see what the requested data looks like.
print(all_full_results[0])
{'analysis_type': ['metagenomics'], 'carb_nitro_ratio': {'has_numeric_value': 25.4, 'type': 'nmdc:QuantityValue', 'has_unit': '1'}, 'collection_date': {'has_raw_value': '2016-07-26T01:30Z', 'type': 'nmdc:TimestampValue'}, 'depth': {'has_maximum_numeric_value': 0.325, 'has_minimum_numeric_value': 0, 'has_unit': 'm', 'type': 'nmdc:QuantityValue'}, 'elev': 677.6, 'env_broad_scale': {'term': {'id': 'ENVO:00000446', 'name': 'terrestrial biome', 'type': 'nmdc:OntologyClass'}, 'type': 'nmdc:ControlledIdentifiedTermValue'}, 'env_local_scale': {'term': {'id': 'ENVO:01000861', 'name': 'area of dwarf scrub', 'type': 'nmdc:OntologyClass'}, 'type': 'nmdc:ControlledIdentifiedTermValue'}, 'env_medium': {'term': {'id': 'ENVO:00001998', 'name': 'soil', 'type': 'nmdc:OntologyClass'}, 'type': 'nmdc:ControlledIdentifiedTermValue'}, 'env_package': {'has_raw_value': 'soil', 'type': 'nmdc:TextValue'}, 'id': 'nmdc:bsm-11-002vgm56', 'name': 'HEAL_048-O-6.5-19.5-20160725', 'nitro': {'has_numeric_value': 1.02, 'has_unit': '%', 'type': 'nmdc:QuantityValue'}, 'org_carb': {'has_numeric_value': 25.94, 'has_unit': '%', 'type': 'nmdc:QuantityValue'}, 'ph': 6.04, 'samp_collec_device': 'corer', 'soil_horizon': 'O horizon', 'temp': {'has_numeric_value': 6.6, 'has_unit': 'Cel', 'type': 'nmdc:QuantityValue'}, 'type': 'nmdc:Biosample', 'water_content': ['2.667 g of water/g of dry soil'], 'geo_loc_name': {'has_raw_value': 'USA: Alaska, Healy', 'type': 'nmdc:TextValue'}, 'biosample_categories': ['NEON'], 'lat_lon': {'latitude': 63.875088, 'longitude': -149.210438, 'type': 'nmdc:GeolocationValue'}, 'gold_biosample_identifiers': ['gold:Gb0255739'], 'ecosystem': 'Environmental', 'ecosystem_category': 'Terrestrial', 'ecosystem_type': 'Soil', 'ecosystem_subtype': 'Unclassified', 'associated_studies': ['nmdc:sty-11-34xj1150'], 'specific_ecosystem': 'Unclassified', 'img_identifiers': ['img.taxon:3300039974']}
Convert results to a data frame and transform into desired format¶
We can use the pandas python library to convert the requested results into a data frame. By looping through the results, we can update the data types for the fields (e.g. convert collection_date to a date using the datetime library and drop the times) and create a dictionary where the keys are the columns and the values are the results we are looping through, appending the dictionary to an initially empty list (df_inp). This allows for easy conversion to a data frame.
# Convert results to dataframes
# Transform results to desired format and convert to a data frame
df_inp = []
water_content_units = "g of water/g of dry soil"
for biosamp in all_full_results:
# Get only month, day, and year from collection_date (remove times)
date = datetime.strptime(biosamp["collection_date"]["has_raw_value"],"%Y-%m-%dT%H:%MZ")
date = date.strftime("%Y-%m-%d")
# Extract out units of water_content and convert to float
if 'water_content' in biosamp:
water_content = float("".join(biosamp["water_content"]).replace(water_content_units,""))
# Convert ph to a float
if "ph" in biosamp:
ph = float(biosamp["ph"])
else:
ph = None
rec = {"id": biosamp["id"],
"collection_date": date,
"soil_horizon": biosamp["soil_horizon"],
"water_content": water_content,
"ph": ph,
"elev": float(biosamp["elev"]),
"location": biosamp["geo_loc_name"]["has_raw_value"],
"latitude": biosamp["lat_lon"]["latitude"],
"longitude": biosamp["lat_lon"]["longitude"]}
df_inp.append(rec)
df = pd.DataFrame(df_inp)
df
| id | collection_date | soil_horizon | water_content | ph | elev | location | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-11-002vgm56 | 2016-07-26 | O horizon | 2.667 | 6.04 | 677.6 | USA: Alaska, Healy | 63.875088 | -149.210438 |
| 1 | nmdc:bsm-11-00dkyf35 | 2019-03-13 | M horizon | 0.113 | 6.65 | 381.8 | USA: California, San Joaquin Experimental Range | 37.110011 | -119.735218 |
| 2 | nmdc:bsm-11-00hrxp98 | 2016-08-03 | O horizon | 0.992 | 3.90 | 199.7 | USA: Massachusetts, Harvard Forest & Quabbin W... | 42.427091 | -72.229737 |
| 3 | nmdc:bsm-11-00m15h97 | 2020-06-23 | M horizon | 0.032 | 7.07 | 1649.3 | USA: Colorado, Central Plains Experimental Range | 40.818371 | -104.746715 |
| 4 | nmdc:bsm-11-00yhef97 | 2016-07-26 | M horizon | 0.032 | 6.47 | 44.8 | USA: Georgia, The Jones Center At Ichauway | 31.189774 | -84.465861 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4470 | nmdc:bsm-11-zy2p7j24 | 2018-07-17 | M horizon | 0.302 | 7.01 | 405.7 | USA: Kansas, Konza Prairie Biological Station | 39.103068 | -96.563925 |
| 4471 | nmdc:bsm-11-zyh2rm11 | 2020-07-28 | M horizon | 0.399 | 6.24 | 405.4 | USA: Kansas, Konza Prairie Biological Station | 39.102192 | -96.561180 |
| 4472 | nmdc:bsm-11-zyhk8g66 | 2017-06-21 | M horizon | 0.306 | 7.15 | 272.5 | USA: Tennessee, Oak Ridge | 35.957646 | -84.261838 |
| 4473 | nmdc:bsm-11-zzdpcm17 | 2020-06-03 | M horizon | 0.283 | 5.07 | 260.1 | USA: Tennessee, Oak Ridge | 35.965875 | -84.230646 |
| 4474 | nmdc:bsm-11-zzrx6210 | 2018-06-20 | M horizon | 0.082 | 5.30 | 120.8 | USA: Alabama, Talladega National Forest | 32.893534 | -87.415333 |
4475 rows × 9 columns
Find min/max and mid coordinates of the latitude and longitude coordinates¶
In order to create a map of the NEON sites, we will need to know the minimum and maximum latitude and longitude coordinates. We can calculate the midpoint by defining a function called find_square_midpoint and calling the function using the min/max coordinates. The mid-coordinates will allow us to set a middle point in our map.
# Find middle coordinates to inform map center
min_lat = df["latitude"].min()
max_lat = df["latitude"].max()
min_lon = df["longitude"].min()
max_lon = df["longitude"].max()
def find_square_midpoint(min_lat, max_lon, max_lat, min_lon):
# calculate midpoint latitude
mid_lat = (min_lat + max_lat) / 2
# calculate midpoint longitude
if abs(max_lon - min_lon) <= 180:
mid_lon = (min_lon + max_lon) / 2
else:
# If the line crosses the 180 degree meridian, adjust the midpoint longitude
mid_lon = ((max_lon + min_lon + 360) % 360) / 2 - 180
return int(round(mid_lat, 0)), int(round(mid_lon, 0))
mid_coords = find_square_midpoint(min_lat, max_lon, max_lat, min_lon)
print(f"min latitude: {min_lat}")
print(f"max latitude: {max_lat}")
print(f"min longitude: {min_lon}")
print(f"max longitude: {max_lon}")
print(f"midpoint coordintates: {mid_coords}")
min latitude: 17.963818 max latitude: 71.285604 min longitude: -156.648849 max longitude: -66.824633 midpoint coordintates: (45, -112)
Create an interactive map of the NEON sites with scatter plots of ph vs. time¶
Using the folium python library to create a map and the altair library to create charts of pH vs. time, we can create a clickable map of the NEON sites that show the pH visualizations. We will have to loop through the NEON sites by grouping our data frame by location and creating a chart of each location within the loop. Note that the mean latitude and longitude for each site were chosen as the map marker coordinates.
## There is a warning coming from one of the sub-package, removing it for clarity
import warnings
warnings.filterwarnings('ignore')
# Create map of NEON sites
m = folium.Map(
tiles="https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}",
attr="Tiles © Esri — Source: Esri, i-cubed, USDA, USGS, AEX, GeoEye, Getmapping, Aerogrid, IGN, IGP, UPR-EGP, and the GIS User Community",
location=(mid_coords),
zoom_start=3,
control_scale=True)
# group data frames by site (e.g. location)
grouped = df.groupby("location")
result_dfs = {}
for name, group_df in grouped:
result_dfs[name] = group_df.reset_index(drop=True)
# Add markers to map based on location name (site) - used the mean coordinates for each site
for name, site_df in result_dfs.items():
mean_lat = site_df["latitude"].mean()
mean_lon = site_df["longitude"].mean()
# Create scatter plot of pH vs. time and add a linear regression
scatter = Chart(site_df).mark_circle().encode(x="collection_date", y="ph", color="soil_horizon:N")
chart = scatter.properties(width=600, height=400, title=f'{name}: Change in soil pH over time')
# Add charts as popup for each NEON site on the map
vega_lite = folium.VegaLite(chart, width="100%", height="100%")
marker = folium.Marker(
location=[mean_lat, mean_lon],
tooltip="click me!")
popup = folium.Popup()
vega_lite.add_to(popup)
popup.add_to(marker)
marker.add_to(m)
m
Bar chart of the average water_content of each site and the season¶
To look at how the water_content of each NEON site changes with the seasons, we can add a column to the data frame for the month and create a function that adds a column for the season based on the number of the month. Finally, using the altair library we can aggregate the water_content for each location using the mean.
## There is a warning coming from one of the sub-package, removing it for clarity
import warnings
warnings.filterwarnings('ignore')
# How does the average water_content change with the seasons per site?
df["collection_date"] = pd.to_datetime(df["collection_date"])
df["month"] = df["collection_date"].dt.month
def get_season(month):
if month in [12, 1, 2]:
return "Winter"
elif month in [3, 4, 5]:
return "Spring"
elif month in [6, 7, 8]:
return "Summer"
else:
return "Fall"
df["season"] = df["month"].apply(get_season)
bar = Chart(df).mark_bar().encode(x="location", y=alt.Y('mean(water_content)', title=f"mean water_content in {water_content_units}"),color="season",xOffset="season")
error_bar = Chart(df).mark_errorbar().encode(x="location",xOffset="season", y=alt.Y('mean(water_content)'))
chart = bar.properties(width=1000, height=400, title=f'Change in average water_content of NEON sites by season') + error_bar
chart