Identifying relationships between annotated omics data in NMDCΒΆ
This notebook is an example of how different omics data types may be linked via commonly used annotation vocabularies and investigated together. In this notebook we explore biomolecules and KEGG pathways identified in a set of samples that have processed metagenomics, metaproteomics, and metabolomics data available in the NMDC Data Portal.
import requests
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import seaborn as sns
import numpy as np
import sys
from io import StringIO
from Bio.KEGG import REST
from pycirclize import Circos
import time
import importlib.util
import nmdc_api_utilities
Identifying relationships between annotated omics data in NMDCΒΆ
This notebook is an example of how different omics data types may be linked via commonly used annotation vocabularies and investigated together. In this notebook we explore biomolecules and KEGG pathways identified in a set of samples that have processed metagenomics, metaproteomics, and metabolomics data available in the NMDC Data Portal.
1. Retrieve data from the NMDC database using API endpointsΒΆ
Choose data to retrieveΒΆ
The NMDC data portal (https://data.microbiomedata.org/) allows us to filter data and samples according to many criteria. In this case, we use the Data Type filters (the upset plot under the interactive map widget) to identify samples that have metagenomics, metaproteomics, and metabolomics data. This returns samples from the study "Riverbed sediment microbial communities from the Columbia River, Washington, USA" (https://data.microbiomedata.org/details/study/nmdc:sty-11-aygzgv51).
Retrieve and filter data for Columbia River sediment studyΒΆ
The study page linked above has the NMDC study identifier in the URL: nmdc:sty-11-aygzgv51. We will use the nmdc_api_utilities package to access the data_objects/study endpoint from the NMDC Runtime API to retrieve all records that represent data. This includes URLs for downloading raw data files (e.g. FASTQ or mass spectra files) as well as processed data results output by the NMDC workflows.
pd.set_option("display.max_rows", 6)
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch(env=ENV)
#get all data objects associated with this study id
data = do_client.get_data_objects_for_studies(study_id='nmdc:sty-11-aygzgv51')
data = pd.DataFrame(data)
#reformat data into dataframe
data_objects=[]
for index, row in data.iterrows():
bio_id = row['biosample_id']
row_out = pd.json_normalize(row['data_objects'])
row_out['biosample_id'] = bio_id
data_objects.append(row_out)
data_objects = pd.concat(data_objects).reset_index(drop=True)
display(data_objects)
del data, index, row, row_out, bio_id
| id | type | name | description | file_size_bytes | md5_checksum | data_object_type | url | data_category | was_generated_by | alternative_identifiers | in_manifest | biosample_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-31amy823 | nmdc:DataObject | nmdc_wfmgan-11-bnzfrb88.1_cath_funfam.gff | CATH FunFams for nmdc:wfmgan-11-bnzfrb88.1 | 40275912 | b4188599ca4603182ebfe16ba26205df | CATH FunFams (Functional Families) Annotation GFF | https://data.microbiomedata.org/data/nmdc:ompr... | processed_data | NaN | NaN | NaN | nmdc:bsm-13-0jw5n594 |
| 1 | nmdc:dobj-11-rsbcxm60 | nmdc:DataObject | nmdc_wfmag-11-2xsazh74.1_gtdbtk.ar122.summary.tsv | Archaeal Summary for nmdc:wfmag-11-2xsazh74.1 | 49 | 87d56b6d56bb79e45844847a53a054a7 | GTDBTK Archaeal Summary | https://data.microbiomedata.org/data/nmdc:ompr... | processed_data | nmdc:wfmag-11-2xsazh74.1 | NaN | NaN | nmdc:bsm-13-0jw5n594 |
| 2 | nmdc:dobj-11-vy3qsc25 | nmdc:DataObject | 511406_Froze_Core_2015_N1_0_10_13_Lip_POS_28Ju... | LC-DDA-MS/MS raw data for lipidomics data acqu... | 56565794 | 261b975cc92e1f7b37df31613f8df866 | LC-DDA-MS/MS Raw Data | https://nmdcdemo.emsl.pnnl.gov/lipidomics/steg... | instrument_data | nmdc:dgms-11-ebehx343 | NaN | NaN | nmdc:bsm-13-0jw5n594 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3512 | nmdc:dobj-13-nf2htz54 | nmdc:DataObject | output: Unground_SBR_Spring_2014_FC_N3_20-30_M... | High resolution MS spectra only | 30508405 | NaN | Direct Infusion FT ICR-MS Raw Data | NaN | instrument_data | NaN | [emsl:output_457928] | NaN | nmdc:bsm-13-zq681s85 |
| 3513 | nmdc:dobj-13-8eencq04 | nmdc:DataObject | output: SBR_FC_N3_20-30_H2Oext_13Oct15_Leopard... | High resolution MS spectra only | 30542330 | NaN | Direct Infusion FT ICR-MS Raw Data | NaN | instrument_data | NaN | [emsl:output_456484] | NaN | nmdc:bsm-13-zq681s85 |
| 3514 | nmdc:dobj-13-3r1cxb21 | nmdc:DataObject | SBR_FC_N3_20-30_H2Oext_13Oct15_Leopard_1_01_44... | EnviroMS FT ICR-MS natural organic matter work... | 20800 | c463c5f2768a32f55bc46a1b04229d21 | Direct Infusion FT-ICR MS Analysis Results | https://nmdcdemo.emsl.pnnl.gov/nom/results/SBR... | processed_data | nmdc:wfnom-13-9s4y5107.1 | NaN | NaN | nmdc:bsm-13-zq681s85 |
3515 rows Γ 13 columns
One way of further identifying a NMDC DataObject record is by looking at its slot data_object_type (https://microbiomedata.github.io/nmdc-schema/data_object_type/), which contains a value from FileTypeEnum (https://microbiomedata.github.io/nmdc-schema/FileTypeEnum/).
We want to look at the processed data results for our three omics types of interest in this notebook. Specifically, we want the files containing KEGG Orthology, KEGG Compound, and Enzyme Commission annotations. These annotations are available in separate files for the metagenomics data (in 'Annotation Enzyme Commission' and 'Annotation KEGG Orthology' data objects) and are found together in the metabolomics ('GC-Metabolomics Results') and metaproteomics ('Protein Report') results files.
We filter for results files with the following data_object_type values:
| Value | Description |
|---|---|
| Annotation Enzyme Commission | Tab delimited file for EC annotation |
| Annotation KEGG Orthology | Tab delimited file for KO annotation |
| GC-MS Metabolomics Results | GC-MS-based metabolite assignment results table |
| Protein Report | Filtered protein report file |
# Filter to biosamples that have metagenome EC annotations, metagenome KO
# annotations, metaproteomics results, and metabolomics results
data_objects = data_objects.groupby('biosample_id').filter(lambda x: all(data_type in x['data_object_type'].values for data_type in [
"Annotation Enzyme Commission",
"Annotation KEGG Orthology",
"GC-MS Metabolomics Results",
"Protein Report"
])).reset_index(drop=True).drop(columns=['alternative_identifiers', 'in_manifest'])
# Filter to the desired results file types
results_by_biosample = data_objects[data_objects['data_object_type'].isin([
"Annotation Enzyme Commission", "Annotation KEGG Orthology",
"GC-MS Metabolomics Results", "Protein Report"
])][['biosample_id', 'id','data_object_type', 'url']].rename(columns={'id':'output_id'}).reset_index(drop=True)
# Is there one output for each data object type for each biosample?
results_by_biosample['data_object_type'].value_counts().to_frame()
| count | |
|---|---|
| data_object_type | |
| Protein Report | 62 |
| GC-MS Metabolomics Results | 31 |
| Annotation Enzyme Commission | 31 |
| Annotation KEGG Orthology | 31 |
Choose results by workflow versionΒΆ
Notice that results_by_biosample contains more than one Protein Report results file for some samples. When NMDC releases a new workflow version, existing data may be rerun, which means that multiple versions of the results are available.
We will identify the versions of the results we want by looking up the corresponding WorkflowExecution records.
# Filter protein outputs to those associated with workflow v1.2.1 buy pulling the workflow records
# Define a function to split ids into chunks for the API query
def split_list(input_list, chunk_size=100):
result = []
for i in range(0, len(input_list), chunk_size):
result.append(input_list[i:i + chunk_size])
return result
#Instantiate instances for nmdc_api_utilities
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
from nmdc_api_utilities.data_processing import DataProcessing
# since we are querying the workflow collection, we need to create an instance of it
we_client = WorkflowExecutionSearch(env=ENV)
# create an instance of the data processing object
dp_client = DataProcessing()
# create a list of lists of ids to query workflow execution
chunked_list = split_list(results_by_biosample['output_id'].unique().tolist())
version_output = []
# loop through the chunked list of ids
for chunk in chunked_list:
# create the filter - query the WorkflowExecutionSearch object looking for records
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
# get the results
version_output += we_client.get_record_by_filter(filter=filter,fields="has_output,version", max_page_size=100, all_pages=True)
#normalize the json columns and merge
results_by_biosample = pd.json_normalize(version_output).rename(columns={'id':'workflow_id','version':'workflow_version'}).explode('has_output').drop_duplicates().merge(results_by_biosample, left_on='has_output',right_on='output_id').drop('has_output',axis=1)
results_by_biosample = results_by_biosample[['biosample_id','workflow_id','workflow_version','output_id','data_object_type','url']]
results_by_biosample = results_by_biosample[~(results_by_biosample['data_object_type'].str.contains('Protein Report', na=False) & ~results_by_biosample['workflow_version'].str.contains('v1.2.1',na=False))] # ~[condition] translates to keep rows where either both conditions are met or neither is met (Protein Report with v1.2.1 or its another type of report)
results_by_biosample
| biosample_id | workflow_id | workflow_version | output_id | data_object_type | url | |
|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-13-as680w10 | nmdc:wfmgan-11-fk6zc010.1 | v1.1.0 | nmdc:dobj-11-mrpk2420 | Annotation KEGG Orthology | https://data.microbiomedata.org/data/nmdc:ompr... |
| 1 | nmdc:bsm-13-as680w10 | nmdc:wfmgan-11-fk6zc010.1 | v1.1.0 | nmdc:dobj-11-0j3myn58 | Annotation Enzyme Commission | https://data.microbiomedata.org/data/nmdc:ompr... |
| 2 | nmdc:bsm-13-13145k83 | nmdc:wfmgan-11-dvvh0237.1 | v1.1.0 | nmdc:dobj-11-0q21fa03 | Annotation KEGG Orthology | https://data.microbiomedata.org/data/nmdc:ompr... |
| ... | ... | ... | ... | ... | ... | ... |
| 152 | nmdc:bsm-13-kcdh3w94 | nmdc:wfmb-13-03ye9f39.1 | NaN | nmdc:dobj-13-qm771m31 | GC-MS Metabolomics Results | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... |
| 153 | nmdc:bsm-13-peafgc08 | nmdc:wfmb-13-4ymdss58.1 | NaN | nmdc:dobj-13-s32awd29 | GC-MS Metabolomics Results | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... |
| 154 | nmdc:bsm-13-mvpx4z13 | nmdc:wfmb-13-z2dsr125.1 | NaN | nmdc:dobj-13-w697k984 | GC-MS Metabolomics Results | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... |
124 rows Γ 6 columns
Download selected results filesΒΆ
Now we can use the url slot from the filtered DataObject records to read in all of the files containing the annotations of interest.
# Create one URL column per data type
results_by_biosample = results_by_biosample[['biosample_id', 'data_object_type', 'url']].pivot(
index='biosample_id',
columns='data_object_type',
values='url'
).reset_index()
# Function to read TSV files
def read_tsv_from_url(url:str, column_names:None):
response = requests.get(url)
if column_names is None:
data = pd.read_csv(StringIO(response.text), sep='\t')
else:
data = pd.read_csv(StringIO(response.text), sep='\t', header=None, names=column_names)
return data
# Function to read CSV files
def read_csv_from_url(url):
response = requests.get(url)
data = pd.read_csv(StringIO(response.text))
return data
# Read EC results
ec_columns = ["gene_id", "img_ko_flag", "EC", "percent_identity",
"query_start", "query_end", "subj_start", "subj_end",
"evalue", "bit_score", "align_length"]
results_by_biosample['metag_ec_results'] = results_by_biosample['Annotation Enzyme Commission'].apply(
lambda x: read_tsv_from_url(x, ec_columns) if pd.notnull(x) else None
)
# Read KO results
ko_columns = ["gene_id", "img_ko_flag", "ko_term", "percent_identity",
"query_start", "query_end", "subj_start", "subj_end",
"evalue", "bit_score", "align_length"]
results_by_biosample['metag_ko_results'] = results_by_biosample['Annotation KEGG Orthology'].apply(
lambda x: read_tsv_from_url(x, ko_columns) if pd.notnull(x) else None
)
# Read Protein Report
results_by_biosample['metap_results'] = results_by_biosample['Protein Report'].apply(
lambda x: read_tsv_from_url(url=x,column_names=None) if pd.notnull(x) else None
)
# Read GC-MS Metabolomics Results
results_by_biosample['metab_results'] = results_by_biosample['GC-MS Metabolomics Results'].apply(
lambda x: read_csv_from_url(x) if pd.notnull(x) else None
)
results_by_biosample
| data_object_type | biosample_id | Annotation Enzyme Commission | Annotation KEGG Orthology | GC-MS Metabolomics Results | Protein Report | metag_ec_results | metag_ko_results | metap_results | metab_results |
|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-13-0jw5n594 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | ge... | g... | Data... | Sample name Peak I... |
| 1 | nmdc:bsm-13-13145k83 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | ge... | ge... | Data... | Sample name Peak I... |
| 2 | nmdc:bsm-13-1p0tct86 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | g... | gen... | Dat... | Sample name Peak ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28 | nmdc:bsm-13-xvf06c81 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | gene... | gene... | Dat... | Sample name Peak In... |
| 29 | nmdc:bsm-13-yjp2dx65 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | gene_id ... | g... | Dat... | Sample name Peak ... |
| 30 | nmdc:bsm-13-zms1jq91 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | gene... | g... | Dat... | Sample name Peak In... |
31 rows Γ 9 columns
Each of the downloaded data files contains detailed information including the KO, EC, or KEGG Compound identifiers.
# View a snippet of the metagenome KEGG Orthology annotations file
results_by_biosample['metag_ko_results'].iloc[0]
| gene_id | img_ko_flag | ko_term | percent_identity | query_start | query_end | subj_start | subj_end | evalue | bit_score | align_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:wfmgan-11-bnzfrb88.1_000001_5389_7344 | 2794162845 | KO:K03336 | 82.21 | 2 | 651 | 8 | 656 | 0.000000e+00 | 1210.0 | 650 |
| 1 | nmdc:wfmgan-11-bnzfrb88.1_000001_7341_8171 | 2527268703 | KO:K02510 | 78.12 | 13 | 268 | 11 | 265 | 7.400000e-139 | 453.0 | 256 |
| 2 | nmdc:wfmgan-11-bnzfrb88.1_000001_12306_12872 | 2995783332 | KO:K00010 | 78.31 | 1 | 189 | 5 | 193 | 5.000000e-104 | 345.0 | 189 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 100358 | nmdc:wfmgan-11-bnzfrb88.1_208488_20_277 | 2843963262 | KO:K01563 | 45.78 | 1 | 79 | 30 | 112 | 4.700000e-14 | 78.3 | 79 |
| 100359 | nmdc:wfmgan-11-bnzfrb88.1_208592_3_146 | 2686680187 | KO:K01938 | 87.23 | 1 | 47 | 514 | 560 | 2.500000e-22 | 97.7 | 47 |
| 100360 | nmdc:wfmgan-11-bnzfrb88.1_208762_147_263 | 2946693089 | KO:K00675 | 78.95 | 2 | 39 | 22 | 59 | 8.700000e-13 | 69.9 | 38 |
100361 rows Γ 11 columns
Extract unique identificationsΒΆ
From each results table we can extract the unique list of genes/proteins/metabolites identified in that sample.
We also create a vector for each biomolecule containing all unique identifications across all samples, to simplify API calls later.
# Metagenome annotations - KEGG Orthology
# Save a unique list of gene KO for each sample
results_by_biosample['metag_ko_unique'] = results_by_biosample['metag_ko_results'].apply(
lambda x: sorted(x['ko_term'].dropna().unique(),key=str)
)
# Save a unique list of ALL gene KO for searching later (removing nas that will mess with chunked API calls)
metag_ko_unique = list(set(results_by_biosample['metag_ko_unique'].explode().dropna()))
# Metagenome annotations - Enzyme Commission
# Save a unique vector of gene EC by sample
results_by_biosample['metag_ec_unique'] = results_by_biosample['metag_ec_results'].apply(
lambda x: sorted(x['EC'].dropna().unique(),key=str)
)
# Save a unique list of ALL gene EC for searching later
metag_ec_unique = list(set(results_by_biosample['metag_ec_unique'].explode().dropna()))
# Metaproteome annotations - KEGG Orthology
# Protein KO by sample
results_by_biosample['metap_ko_unique'] = results_by_biosample['metap_results'].apply(
lambda x: sorted(x['KO'].dropna().unique(),key=str)
)
# ALL protein KO
metap_ko_unique = list(set(results_by_biosample['metap_ko_unique'].explode().dropna()))
# Metaproteome annotations - Enzyme Commission
# Protein EC by sample
results_by_biosample['metap_ec_unique'] = results_by_biosample['metap_results'].apply(
lambda x: sorted(x['EC_Number'].dropna().str.split(',').explode().unique(),key=str)
)
# ALL protein EC
metap_ec_unique = list(set(results_by_biosample['metap_ec_unique'].explode().dropna()))
# Dataframe of metaproteome annotations across all samples
# Keep KO and EC results together
metap_combined_unique = pd.concat(results_by_biosample['metap_results'].tolist(), ignore_index=True)[['KO','EC_Number']].rename(columns = {'EC_Number':'ec_id', 'KO':'ko_id'})
# EC column may have multiple comma separated values
# Reform dataframe to have one EC per row
metap_combined_unique['ec_id'] = metap_combined_unique['ec_id'].str.split(',')
metap_combined_unique = metap_combined_unique.explode('ec_id').reset_index(drop=True).drop_duplicates()
# Remove rows that are all NA
metap_combined_unique.dropna(how='all',inplace=True)
# If KO id is present in more than one row, once with a mapping EC id and once with NA, remove the NA row
# Identify ko ids in more than one row
duplicate = metap_combined_unique['ko_id'].duplicated(keep=False)
# Remove the rows with NA values in ec_id for those groups present more than once
metap_combined_unique = metap_combined_unique[~(duplicate & metap_combined_unique['ec_id'].isna())]
# Metabolome annotations - KEGG Compound
# Compound IDs per sample (added white space removal because one CO had \t)
results_by_biosample['metab_co_unique'] = results_by_biosample['metab_results'].apply(
lambda x: sorted(x['Kegg Compound ID'].dropna().str.strip().unique(),key=str)
)
# ALL compound IDs
metab_co_unique = list(set(results_by_biosample['metab_co_unique'].explode().dropna()))
results_by_biosample
| data_object_type | biosample_id | Annotation Enzyme Commission | Annotation KEGG Orthology | GC-MS Metabolomics Results | Protein Report | metag_ec_results | metag_ko_results | metap_results | metab_results | metag_ko_unique | metag_ec_unique | metap_ko_unique | metap_ec_unique | metab_co_unique |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-13-0jw5n594 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | ge... | g... | Data... | Sample name Peak I... | [KO:K00001, KO:K00003, KO:K00004, KO:K00008, K... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K01607, KO:K01992, KO:K01999, KO:K02058, K... | [EC:1.8.3.7, EC:3.4.21.-, EC:4.1.1.44, EC:7.4.... | [C00009, C00013, C00016, C00025, C00026, C0003... |
| 1 | nmdc:bsm-13-13145k83 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | ge... | ge... | Data... | Sample name Peak I... | [KO:K00001, KO:K00002, KO:K00003, KO:K00004, K... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00368, KO:K00958, KO:K01996, KO:K02058, K... | [EC:1.20.4.4, EC:1.7.2.1, EC:2.7.7.4, EC:6.5.1.8] | [C00009, C00013, C00026, C00037, C00041, C0004... |
| 2 | nmdc:bsm-13-1p0tct86 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | g... | gen... | Dat... | Sample name Peak ... | [KO:K00001, KO:K00003, KO:K00004, KO:K00008, K... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00023, KO:K00077, KO:K00114, KO:K00138, K... | [EC:1.1.1.169, EC:1.1.1.36, EC:1.1.2.10, EC:1.... | [C00009, C00013, C00016, C00020, C00026, C0003... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28 | nmdc:bsm-13-xvf06c81 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | gene... | gene... | Dat... | Sample name Peak In... | [KO:K00001, KO:K00003, KO:K00004, KO:K00005, K... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00113, KO:K00114, KO:K00124, KO:K00244, K... | [EC:1.1.2.10, EC:1.1.2.8, EC:1.11.1.24, EC:1.1... | [C00013, C00016, C00041, C00047, C00049, C0005... |
| 29 | nmdc:bsm-13-yjp2dx65 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | gene_id ... | g... | Dat... | Sample name Peak ... | [KO:K00001, KO:K00003, KO:K00004, KO:K00008, K... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00027, KO:K00031, KO:K00033, KO:K00059, K... | [EC:1.1.1.-, EC:1.1.1.100, EC:1.1.1.291, EC:1.... | [C00009, C00013, C00016, C00020, C00025, C0002... |
| 30 | nmdc:bsm-13-zms1jq91 | https://data.microbiomedata.org/data/nmdc:ompr... | https://data.microbiomedata.org/data/nmdc:ompr... | https://nmdcdemo.emsl.pnnl.gov/metabolomics/re... | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | gene... | g... | Dat... | Sample name Peak In... | [KO:K00001, KO:K00003, KO:K00004, KO:K00005, K... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00008, KO:K00362, KO:K00368, KO:K00518, K... | [EC:1.1.1.14, EC:1.1.1.47, EC:1.1.2.10, EC:1.1... | [C00009, C00013, C00016, C00025, C00042, C0004... |
31 rows Γ 14 columns
2. Use annotations in file output to link identificationsΒΆ
Metagenomic and metaproteomic results in NMDC already have both KEGG Orthology and Enzyme Commission annotations which can be used to link gene and protein identifications without any additional information from KEGG.
The plot below is a simple example of how to start associating the gene/protein results by counting the number of metagenomic KO annotations that also appear in the metaproteomic results.
# For each sample, check which metagenomic KO results also appear in the metaproteomic KO results
for index, row in results_by_biosample.iterrows():
metag_ko_unique_list = row['metag_ko_unique']
metap_ko_unique_list = row['metap_ko_unique']
results_by_biosample.at[index,'genes_w_protein'] = len([value for value in metag_ko_unique_list if value in metap_ko_unique_list])
results_by_biosample = results_by_biosample.sort_values('genes_w_protein', ascending=False).reset_index(drop=True)
plt.figure(figsize=(12, 6))
plt.bar(results_by_biosample['biosample_id'], results_by_biosample['genes_w_protein'], color='steelblue')
plt.title("Metagenome identified genes with KO annotations in metaproteome results")
plt.xlabel("Biosample ID")
plt.ylabel("Number of genes")
plt.xticks(rotation=60, ha='right')
plt.tight_layout()
plt.show()
The rest of this notebook uses the KEGG API to retrieve additional information for identified biomolecules, which enables more multi-omic analysis and provides more context for their biological relevance. Use of the KEGG API is restricted to academic users or commercial license holders (see Licensing Information).
3. Get IDs from other KEGG databasesΒΆ
Now we will use the KEGGREST package (available on Biopython) to make calls to the KEGG API. Using the annotations provided in the NMDC results files, we can look up their corresponding annotations in KEGG databases to start drawing connections between biomolecule identifications.
For example, in the first use of REST.keggLink() below, we are searching the KEGG ENZYME database for Enzyme Commission numbers that are linked to each KEGG Compound ID in the character list metab_co_unique.
We will also search for linked KEGG modules and pathways. KEGG pathways are manually drawn maps that represent all the reactions in a biological process and are typically composed of a number of more discrete functional units, the KEGG modules.
Notes:
keggLink()calls are paired with a 0.5stime.sleep()command. This is to limit the number of API calls we are making per second to avoid overloading the KEGG REST API and having API calls rejected.- Character lists of IDs to query are broken up into 50-element pieces to avoid overly long HTTP request URLs.
Gather metabolite informationΒΆ
First we will find all of the Enzyme Commission numbers available for each compound identified in the metabolomics results. These EC numbers represent enzymes that catalyze reactions involving the compound of interest.
# Define function to get IDs for a certain KEGG database
def get_ids(compounds,database):
time.sleep(0.5)
dict = {}
return_string = REST.kegg_link(database, compounds).read()
strings = return_string.strip().split('\n')
for string in strings:
if string:
key, value = string.split('\t')
if key in dict:
dict[key].append(value)
else:
dict[key] = [value]
return dict
# get EC ids for each metabolite CO
ec_from_metabolites = {}
for i in range(0, len(metab_co_unique), 50):
chunk = metab_co_unique[i:i+50]
ec_from_metabolites.update(get_ids(chunk,'enzyme'))
# Format into a dataframe
ec_from_metabolites = pd.DataFrame({
'compound_id': list(ec_from_metabolites.keys()),
'ec_id': list(ec_from_metabolites.values())
})
display(ec_from_metabolites)
| compound_id | ec_id | |
|---|---|---|
| 0 | cpd:C01179 | [ec:1.1.1.110, ec:1.1.1.237, ec:1.13.11.27, ec... |
| 1 | cpd:C02287 | [ec:1.1.1.167, ec:2.3.1.106] |
| 2 | cpd:C00037 | [ec:1.13.11.78, ec:1.21.4.2, ec:1.4.1.1, ec:1.... |
| ... | ... | ... |
| 478 | cpd:C00794 | [ec:1.1.1.14, ec:1.1.1.15, ec:1.1.1.21, ec:1.1... |
| 479 | cpd:C03547 | [ec:1.14.14.177, ec:1.14.14.80, ec:1.14.15.3] |
| 480 | cpd:C15587 | [ec:2.4.2.1, ec:3.2.2.1] |
481 rows Γ 2 columns
Then we will do the same to pull all of the modules and pathways that each compound is a part of.
modules_from_metabolites = {}
for i in range(0, len(metab_co_unique), 50):
chunk = metab_co_unique[i:i+50]
modules_from_metabolites.update(get_ids(chunk,'module'))
# Format into a dataframe
modules_from_metabolites = pd.DataFrame({
'compound_id': list(modules_from_metabolites.keys()),
'module_id': list(modules_from_metabolites.values())
})
pathways_from_metabolites = {}
for i in range(0, len(metab_co_unique), 50):
chunk = metab_co_unique[i:i+50]
pathways_from_metabolites.update(get_ids(chunk,'pathway'))
# Format into a dataframe
pathways_from_metabolites = pd.DataFrame({
'compound_id': list(pathways_from_metabolites.keys()),
'pathway_id': list(pathways_from_metabolites.values())
})
Join the results to produce a dataframe containing EC, module, and pathway IDs for each metabolite identified across all samples.
# Create the initial dataframe
metabolite_annotations = pd.DataFrame({
'compound_id': [f'cpd:{id}' for id in metab_co_unique],
'compound_trimmed': metab_co_unique
})
#merge
metabolite_annotations = (metabolite_annotations
.merge(ec_from_metabolites, on='compound_id', how='left')
.merge(modules_from_metabolites, on='compound_id', how='left')
.merge(pathways_from_metabolites, on='compound_id', how='left')
)
display(metabolite_annotations)
| compound_id | compound_trimmed | ec_id | module_id | pathway_id | |
|---|---|---|---|---|---|
| 0 | cpd:C01179 | C01179 | [ec:1.1.1.110, ec:1.1.1.237, ec:1.13.11.27, ec... | [md:M00025, md:M00044] | [path:map00130, path:map00261, path:map00350, ... |
| 1 | cpd:C02043 | C02043 | NaN | NaN | NaN |
| 2 | cpd:C02287 | C02287 | [ec:1.1.1.167, ec:2.3.1.106] | NaN | NaN |
| ... | ... | ... | ... | ... | ... |
| 544 | cpd:C00794 | C00794 | [ec:1.1.1.14, ec:1.1.1.15, ec:1.1.1.21, ec:1.1... | NaN | [path:map00051, path:map00052, path:map01100, ... |
| 545 | cpd:C03547 | C03547 | [ec:1.14.14.177, ec:1.14.14.80, ec:1.14.15.3] | NaN | [path:map00071] |
| 546 | cpd:C15587 | C15587 | [ec:2.4.2.1, ec:3.2.2.1] | NaN | NaN |
547 rows Γ 5 columns
Finally, use a filtering join on the list of metabolite identifications per biosample and the list of all metabolite annotations to create a list of dataframes containing all annotations per biosample.
#initialize new column that will store metabolite annotation info for sample
results_by_biosample['metab_df'] = None
for index, row in results_by_biosample.iterrows():
# For each sample, assemble a dataframe of metabolite information
# Start with the unique list of metabolites for this sample
metab_info = pd.DataFrame({'compound_id':[f"cpd:{cpd}" for cpd in row['metab_co_unique']]})
# Filter join all metabolite annotations down to just the ones in this sample
metab_info = metab_info.merge(
metabolite_annotations,
on='compound_id',
how='left'
)
#save dataframe into new column of df
results_by_biosample.at[index,'metab_df'] = metab_info
# Display one example dataframe
display(results_by_biosample.iloc[0]['metab_df'])
| compound_id | compound_trimmed | ec_id | module_id | pathway_id | |
|---|---|---|---|---|---|
| 0 | cpd:C00009 | C00009 | [ec:1.13.11.78, ec:1.13.11.89, ec:1.13.11.90, ... | [md:M00157, md:M00158, md:M00159, md:M00160] | [path:map00190, path:map00195, path:map02010, ... |
| 1 | cpd:C00016 | C00016 | [ec:1.1.5.3, ec:1.1.5.4, ec:1.1.98.3, ec:1.1.9... | [md:M00125, md:M00911] | [path:map00740, path:map01100, path:map01110, ... |
| 2 | cpd:C00025 | C00025 | [ec:1.14.11.76, ec:1.2.1.88, ec:1.4.1.13, ec:1... | [md:M00015, md:M00027, md:M00028, md:M00045, m... | [path:map00220, path:map00250, path:map00330, ... |
| ... | ... | ... | ... | ... | ... |
| 425 | cpd:C19910 | C19910 | [ec:5.1.3.24] | NaN | NaN |
| 426 | cpd:C19911 | C19911 | [ec:1.14.14.10] | NaN | NaN |
| 427 | cpd:C20450 | C20450 | [ec:1.1.3.47] | NaN | [path:map00365, path:map01120] |
428 rows Γ 5 columns
Gather protein informationΒΆ
Next we will do the same thing for proteins in each sample - starting with the KO and EC IDs from the NMDC annotation workflow, identify the modules and pathways each protein is associated with, and what corresponding metabolites/genes were identified in this sample.
# Use concatenated proteins vector, make all the calls once, then split up by biosample
# Get all module ids for each protein
modules_from_proteins = {}
for i in range(0, len(metap_ko_unique), 50):
chunk = metap_ko_unique[i:i+50]
## API doesn't like the prefix in the annotated results, trim it off
chunk = [item.split(':')[1] for item in chunk if isinstance(item, str)]
modules_from_proteins.update(get_ids(chunk,'module'))
# Format into a dataframe
modules_from_proteins = pd.DataFrame({
'ko_id': list(modules_from_proteins.keys()),
'module_id': list(modules_from_proteins.values())
})
# Convert 'ko_id' to uppercase
modules_from_proteins['ko_id'] = modules_from_proteins['ko_id'].str.upper()
# Get all pathway ids for each protein
pathways_from_proteins = {}
for i in range(0, len(metap_ko_unique), 50):
chunk = metap_ko_unique[i:i+50]
chunk = [item.split(':')[1] for item in chunk if isinstance(item, str)]
pathways_from_proteins.update(get_ids(chunk,'pathway'))
# Format into a dataframe
pathways_from_proteins = pd.DataFrame({
'ko_id': list(pathways_from_proteins.keys()),
'pathway_id': list(pathways_from_proteins.values())
})
# Convert 'ko_id' to uppercase
pathways_from_proteins['ko_id'] = pathways_from_proteins['ko_id'].str.upper()
# Assemble, starting from the dataframe of joined KO/EC annotations from the Protein Report
protein_annotations = (metap_combined_unique
.merge(modules_from_proteins,on='ko_id',how='left')
.merge(pathways_from_proteins,on='ko_id',how='left')
)
#initialize new column that will store protein annotation info for sample
results_by_biosample['prot_info'] = None
# Create column for protein dataframes
for index, row in results_by_biosample.iterrows():
# For each sample, assemble a dataframe of protein information
# Start with the unique list of proteins for this sample
prot_info = (pd.DataFrame({'ko_id':row['metap_ko_unique']})
# Filter join all protein annotations down to just the ones in this sample
.merge(protein_annotations, on='ko_id', how='left'))
#save dataframe into new column of df
results_by_biosample.at[index,'prot_info'] = prot_info
# Display one example dataframe
display(results_by_biosample.iloc[0]['prot_info'])
| ko_id | ec_id | module_id | pathway_id | |
|---|---|---|---|---|
| 0 | KO:K00023 | EC:1.1.1.36 | [md:M00373] | [path:map00630, path:ko00630, path:map00650, p... |
| 1 | KO:K00024 | EC:1.1.1.37 | [md:M00009, md:M00011, md:M00012, md:M00168, m... | [path:map00020, path:ko00020, path:map00270, p... |
| 2 | KO:K00059 | EC:1.1.1.100 | [md:M00083, md:M00572, md:M00874] | [path:map00061, path:ko00061, path:map00333, p... |
| ... | ... | ... | ... | ... |
| 141 | KO:K21395 | NaN | NaN | NaN |
| 142 | KO:K23995 | EC:1.1.2.10 | [md:M00174] | [path:map00680, path:ko00680, path:map01100, p... |
| 143 | KO:K24158 | EC:1.11.1.24 | NaN | NaN |
144 rows Γ 4 columns
Gather gene informationΒΆ
Finally we will find modules and pathways for each gene and create the same list of dataframes for the metagenomes in each sample.
# Use concatenated metag KO vector, make all the calls once, then split up by biosample
# Get all module ids for each gene
modules_from_genes = {}
for i in range(0, len(metag_ko_unique), 50):
chunk = metag_ko_unique[i:i+50]
## API doesn't like the prefix in the annotated results, trim it off
chunk = [item.split(':')[1] for item in chunk if isinstance(item, str)]
modules_from_genes.update(get_ids(chunk,'module'))
# Format into a dataframe
modules_from_genes = pd.DataFrame({
'ko_id': list(modules_from_genes.keys()),
'module_id': list(modules_from_genes.values())
})
# Convert 'ko_id' to uppercase
modules_from_genes['ko_id'] = modules_from_genes['ko_id'].str.upper()
# Get all pathway ids for each gene
pathways_from_genes = {}
for i in range(0, len(metag_ko_unique), 50):
chunk = metag_ko_unique[i:i+50]
chunk = [item.split(':')[1] for item in chunk if isinstance(item, str)]
pathways_from_genes.update(get_ids(chunk,'pathway'))
# Format into a dataframe
pathways_from_genes = pd.DataFrame({
'ko_id': list(pathways_from_genes.keys()),
'pathway_id': list(pathways_from_genes.values())
})
# Convert 'ko_id' to uppercase
pathways_from_genes['ko_id'] = pathways_from_genes['ko_id'].str.upper()
# Assemble
gene_annotations = (pd.DataFrame({'ko_id' : metag_ko_unique})
.merge(modules_from_genes, on='ko_id',how='left')
.merge(pathways_from_genes, on='ko_id',how='left')
)
gene_annotations
| ko_id | module_id | pathway_id | |
|---|---|---|---|
| 0 | KO:K21178 | [md:M00826] | [path:map01059, path:ko01059, path:map01100, p... |
| 1 | KO:K19449 | NaN | NaN |
| 2 | KO:K01938 | [md:M00140, md:M00377] | [path:map00670, path:ko00670, path:map00720, p... |
| ... | ... | ... | ... |
| 7697 | KO:K14667 | NaN | NaN |
| 7698 | KO:K07546 | [md:M00418] | [path:map00623, path:ko00623, path:map01100, p... |
| 7699 | KO:K07450 | NaN | NaN |
7700 rows Γ 3 columns
The metagenome results have KO and EC IDs in different annotation files, as we saw above when reading in results files. Join the EC IDs to the KO IDs here.
There is a filter applied in this step to only retain genes that have EC IDs. This is because there are more gene identifications than there are proteins and metabolites, and for this notebook we want to focus on the genes that could potentially be linked to proteins or metabolites by their EC ID. This affects downstream pathway coverage counts.
#initialize new column that will store gene annotation info for sample
results_by_biosample['gene_info'] = None
# Create column for metagenome dataframes
for index, row in results_by_biosample.iterrows():
# For each sample, assemble a dataframe of combined KO and EC gene information
gene_info = (row['metag_ko_results'][['gene_id','ko_term']]
.merge(row['metag_ec_results'][['gene_id','EC']], on='gene_id', how='outer')
[['ko_term','EC']]
.drop_duplicates()
.reset_index(drop=True)
.rename(columns={'ko_term':'ko_id','EC':'ec_id'}))
# Filter to only KO identifications that also have EC information
gene_info = gene_info.dropna(subset=['ec_id'])
# Now filter join all gene annotations down to just the ones in this sample
gene_info = gene_info.merge(gene_annotations, on='ko_id',how='left')
#save dataframe into new column of df
results_by_biosample.at[index,'gene_info'] = gene_info
# Display one example dataframe
display(results_by_biosample.iloc[0]['gene_info'])
| ko_id | ec_id | module_id | pathway_id | |
|---|---|---|---|---|
| 0 | KO:K02706 | EC:1.10.3.9 | [md:M00161] | [path:map00195, path:ko00195, path:map01100, p... |
| 1 | KO:K03043 | EC:2.7.7.6 | NaN | [path:map03020, path:ko03020] |
| 2 | KO:K03046 | EC:2.7.7.6 | NaN | [path:map03020, path:ko03020] |
| ... | ... | ... | ... | ... |
| 3096 | KO:K18106 | EC:1.1.1.- | [md:M00630] | [path:map00040, path:ko00040, path:map01100, p... |
| 3097 | KO:K13669 | EC:2.4.1.- | NaN | [path:map00571, path:ko00571, path:map01100, p... |
| 3098 | KO:K07679 | EC:2.7.13.3 | NaN | [path:map02020, path:ko02020, path:map05133, p... |
3099 rows Γ 4 columns
Identify linked IDs between biosamplesΒΆ
We can now check which biomolecules have linked annotations in the other two omics types (for example, checking whether each metabolite has a linked annotation in the metagenomic and metaproteomic results).
The code below iterates through samples and adds two columns in each annotation dataframe labelling whether that molecule has a linked annotation among the other two omics types for that sample.
def x_in_y_or_z(data_x, data_y, data_z, column, y_named_output, z_named_output):
for x_index, x_row in data_x.iterrows():
#if data_x's column value is not a float/NAN
if not isinstance(x_row[column], float):
#make the ids being compared a lowercase list to iterate through
x_ids = pd.Series(x_row[column]).dropna().str.lower().tolist()
#make the ids to compare to collapsed lower case lists
all_y_ids = (data_y[column]
.explode()
.dropna().str.lower().tolist())
all_z_ids = (data_z[column]
.explode()
.dropna().str.lower().tolist())
#determine if there are any ids in this row of data_x that match data_y or data_z
In_Y_Annotations = any(element in x_ids for element in all_y_ids)
In_Z_Annotations = any(element in x_ids for element in all_z_ids)
else:
#if data_x's column value is a float/nan then assign it such
In_Y_Annotations = np.nan
In_Z_Annotations = np.nan
#save new columns to metab_df
data_x.at[x_index,y_named_output] = In_Y_Annotations
data_x.at[x_index,z_named_output] = In_Z_Annotations
return data_x
# Add "is this x in y" columns for all biomolecules
for sample_index, sample_row in results_by_biosample.iterrows():
#metabolite, gene and protein info for this biosample
metab_df = sample_row['metab_df']
gene_info = sample_row['gene_info']
prot_info = sample_row['prot_info']
#indicate whether one of metabolite's ec id(s) in this biosample is also in the biosample's metagenome and/or metaproteome results
metab_df = x_in_y_or_z(metab_df, gene_info, prot_info, 'ec_id', 'In_Metag_Annotations', 'In_Metap_Annotations')
results_by_biosample.at[sample_index,'metab_df'] = metab_df
#indicate whether each protein's ec id(s) in this biosample are also in the biosample's metabolome and/or metagenome results
prot_info = x_in_y_or_z(prot_info, metab_df, gene_info, 'ec_id', 'In_Metab_Annotations', 'In_Metag_Annotations')
results_by_biosample.at[sample_index,'prot_info'] = prot_info
#indicate whether each gene's ec id(s) in this biosample are also in the biosample's metabolome and/or metaproteome results
gene_info = x_in_y_or_z(gene_info, metab_df, prot_info, 'ec_id', 'In_Metab_Annotations', 'In_Metap_Annotations')
results_by_biosample.at[sample_index,'gene_info'] = gene_info
display(results_by_biosample.iloc[0]['metab_df'])
display(results_by_biosample.iloc[0]['prot_info'])
display(results_by_biosample.iloc[0]['gene_info'])
/tmp/ipykernel_2226/3724523624.py:29: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'False' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. data_x.at[x_index,y_named_output] = In_Y_Annotations /tmp/ipykernel_2226/3724523624.py:30: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'True' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. data_x.at[x_index,z_named_output] = In_Z_Annotations
/tmp/ipykernel_2226/3724523624.py:29: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'False' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. data_x.at[x_index,y_named_output] = In_Y_Annotations /tmp/ipykernel_2226/3724523624.py:30: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value 'True' has dtype incompatible with float64, please explicitly cast to a compatible dtype first. data_x.at[x_index,z_named_output] = In_Z_Annotations
| compound_id | compound_trimmed | ec_id | module_id | pathway_id | In_Metag_Annotations | In_Metap_Annotations | |
|---|---|---|---|---|---|---|---|
| 0 | cpd:C00009 | C00009 | [ec:1.13.11.78, ec:1.13.11.89, ec:1.13.11.90, ... | [md:M00157, md:M00158, md:M00159, md:M00160] | [path:map00190, path:map00195, path:map02010, ... | True | True |
| 1 | cpd:C00016 | C00016 | [ec:1.1.5.3, ec:1.1.5.4, ec:1.1.98.3, ec:1.1.9... | [md:M00125, md:M00911] | [path:map00740, path:map01100, path:map01110, ... | True | False |
| 2 | cpd:C00025 | C00025 | [ec:1.14.11.76, ec:1.2.1.88, ec:1.4.1.13, ec:1... | [md:M00015, md:M00027, md:M00028, md:M00045, m... | [path:map00220, path:map00250, path:map00330, ... | True | True |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 425 | cpd:C19910 | C19910 | [ec:5.1.3.24] | NaN | NaN | False | False |
| 426 | cpd:C19911 | C19911 | [ec:1.14.14.10] | NaN | NaN | False | False |
| 427 | cpd:C20450 | C20450 | [ec:1.1.3.47] | NaN | [path:map00365, path:map01120] | False | False |
428 rows Γ 7 columns
| ko_id | ec_id | module_id | pathway_id | In_Metab_Annotations | In_Metag_Annotations | |
|---|---|---|---|---|---|---|
| 0 | KO:K00023 | EC:1.1.1.36 | [md:M00373] | [path:map00630, path:ko00630, path:map00650, p... | False | True |
| 1 | KO:K00024 | EC:1.1.1.37 | [md:M00009, md:M00011, md:M00012, md:M00168, m... | [path:map00020, path:ko00020, path:map00270, p... | True | True |
| 2 | KO:K00059 | EC:1.1.1.100 | [md:M00083, md:M00572, md:M00874] | [path:map00061, path:ko00061, path:map00333, p... | False | True |
| ... | ... | ... | ... | ... | ... | ... |
| 141 | KO:K21395 | NaN | NaN | NaN | NaN | NaN |
| 142 | KO:K23995 | EC:1.1.2.10 | [md:M00174] | [path:map00680, path:ko00680, path:map01100, p... | False | True |
| 143 | KO:K24158 | EC:1.11.1.24 | NaN | NaN | False | True |
144 rows Γ 6 columns
| ko_id | ec_id | module_id | pathway_id | In_Metab_Annotations | In_Metap_Annotations | |
|---|---|---|---|---|---|---|
| 0 | KO:K02706 | EC:1.10.3.9 | [md:M00161] | [path:map00195, path:ko00195, path:map01100, p... | False | True |
| 1 | KO:K03043 | EC:2.7.7.6 | NaN | [path:map03020, path:ko03020] | False | True |
| 2 | KO:K03046 | EC:2.7.7.6 | NaN | [path:map03020, path:ko03020] | False | True |
| ... | ... | ... | ... | ... | ... | ... |
| 3096 | KO:K18106 | EC:1.1.1.- | [md:M00630] | [path:map00040, path:ko00040, path:map01100, p... | False | True |
| 3097 | KO:K13669 | EC:2.4.1.- | NaN | [path:map00571, path:ko00571, path:map01100, p... | False | False |
| 3098 | KO:K07679 | EC:2.7.13.3 | NaN | [path:map02020, path:ko02020, path:map05133, p... | False | False |
3099 rows Γ 6 columns
4. Visualize shared biomolecule identificationsΒΆ
The pyCirclize package enables construction of circular diagrams with arcs between sectors showing relationships between multiple groups. We will use pyCirclize to create a chord diagram showing where the annotations from the metagenomic, metaproteomic, and metabolomic results link to each other.
First, choose a biosample from the list to visualize. Then use its gathered annotations to calculate the sectors and arcs needed for the plot. There will be three sectors representing genes, proteins, and metabolites, then three arcs connecting the pairs of sectors.
# Select an example biosample
example_genes = results_by_biosample['gene_info'].iloc[0]
example_prot = results_by_biosample['prot_info'].iloc[0]
example_metab = results_by_biosample['metab_df'].iloc[0]
# Scale sectors so that arc widths are proportional rather than absolute for better visualization
largest_sector_xlim = max(len(example_genes['ko_id'].unique()),
len(example_prot['ko_id'].unique()),
len(example_metab['compound_id'].unique()))
# Gene sector
# Calculate size of sector
gene_sector_xlim = len(example_genes['ko_id'].unique())
# Sum the number of genes that show up in the protein annotations, then scale
gene_sector_prot_count = ((example_genes[['ko_id', 'In_Metap_Annotations']].drop_duplicates())['In_Metap_Annotations'].sum()) * (largest_sector_xlim / gene_sector_xlim)
# Sum the number of genes that show up in protein AND metabolite annotations, then scale
gene_sector_both_count = example_genes.groupby('ko_id').apply(lambda x: (x['In_Metab_Annotations'].any() & x['In_Metap_Annotations'].any()), include_groups=False).sum() * (largest_sector_xlim / gene_sector_xlim)
# Sum the number of genes that show up in metabolite annotations, then scale
gene_sector_metab_count = ((example_genes[['ko_id', 'In_Metab_Annotations']].drop_duplicates())['In_Metab_Annotations'].sum()) * (largest_sector_xlim / gene_sector_xlim)
# Prot sector
prot_sector_xlim = len(example_prot['ko_id'].unique())
prot_sector_metab_count = ((example_prot[['ko_id', 'In_Metab_Annotations']].drop_duplicates())['In_Metab_Annotations'].sum()) * (largest_sector_xlim / prot_sector_xlim)
prot_sector_both_count = example_prot.groupby('ko_id').apply(lambda x: (x['In_Metab_Annotations'].any() & x['In_Metag_Annotations'].any()), include_groups=False).sum() * (largest_sector_xlim / prot_sector_xlim)
prot_sector_gene_count = ((example_prot[['ko_id', 'In_Metag_Annotations']].drop_duplicates())['In_Metag_Annotations'].sum()) * (largest_sector_xlim / prot_sector_xlim)
# Metab sector
metab_sector_xlim = len(example_metab['compound_id'].unique())
metab_sector_gene_count = ((example_metab[['compound_id', 'In_Metag_Annotations']].drop_duplicates())['In_Metag_Annotations'].sum()) * (largest_sector_xlim / metab_sector_xlim)
metab_sector_both_count = example_metab.groupby('compound_id').apply(lambda x: (x['In_Metap_Annotations'].any() & x['In_Metag_Annotations'].any()), include_groups=False).sum() * (largest_sector_xlim / metab_sector_xlim)
metab_sector_prot_count = ((example_metab[['compound_id', 'In_Metap_Annotations']].drop_duplicates())['In_Metap_Annotations'].sum()) * (largest_sector_xlim / metab_sector_xlim)
# Reset sector sizes to scaled value
gene_sector_xlim = largest_sector_xlim
prot_sector_xlim = largest_sector_xlim
metab_sector_xlim = largest_sector_xlim
Now use the calculated sectors/arcs to generate the chord diagram.
# Initialize circos sectors
sectors = {"genes":gene_sector_xlim, "proteins":prot_sector_xlim, "metabolites":metab_sector_xlim}
colors = {"genes": "tomato", "proteins": "gold", "metabolites": "skyblue"}
counts = {"genes":len(example_genes), "proteins":len(example_prot), "metabolites":len(example_metab)}
circos = Circos(sectors, space=3)
# add track for sector colors in outer ring
for sector in circos.sectors:
track = sector.add_track((75, 100))
track.axis(fc=colors[sector.name])
track.text(f"{sector.name} ({counts[sector.name]})", size=15)
#track.xticks_by_interval(200)
# add links between sectors
circos.link(
# Create a link that starts at the beginning of the gene sector
# It is as wide as the scaled number of gene/protein shared annotations (see above)
("genes", 0, gene_sector_prot_count),
# The other end of the link is in the protein sector
# It starts at the number of protein/metabolite but NOT gene shared annotations
# It is as wide as the scaled number of gene/protein shared annotations
("proteins", prot_sector_metab_count - prot_sector_both_count,
prot_sector_metab_count - prot_sector_both_count + prot_sector_gene_count),
color="#ff870099")
# Repeat for metab/genes
circos.link(
("metabolites", 0, metab_sector_gene_count),
("genes", gene_sector_prot_count - gene_sector_both_count,
gene_sector_prot_count - gene_sector_both_count + gene_sector_metab_count),
color="#8b24cd99")
# Repeat for proteins/metab
circos.link(
("proteins", 0, prot_sector_metab_count),
("metabolites", metab_sector_gene_count - metab_sector_both_count,
metab_sector_gene_count - metab_sector_both_count + metab_sector_prot_count),
color="#63ff0099")
circos.plotfig()
plt.show()
In the chord diagram, each arc is showing what proportion of identified biomolecules have a linked annotation in the identified biomolecules on the other end of the arc. For example, the purple arc shows that about a third of the genes have a link to a compound ID that was identified in the metabolomics data for this sample.
Areas where the ends of two arcs overlap (e.g., the green and orange arc ends at the bottom of the protein sector) indicate that those molecules have corresponding identified biomolecules in all three data types.
5. Calculate coverage of KEGG pathways across samplesΒΆ
Assemble a list of all of the pathway IDs that any biomolecules in each sample map to (obtained from API calls earlier). The KEGG API returns duplicate IDs for each pathway, one that is blank, and one that has all of the KO elements highlighted. Exclude the highlighted maps from the list for redundancy. Some of the maps are "overview" maps which encompass an enormous number of metabolic reactions, to the point that it is not especially informative if a compound maps to one. Exclude those pathways as well. (See https://www.genome.jp/kegg/pathway.html#global)
#initialize new column that will store all pathways ids for each biosample
results_by_biosample['pathways'] = None
# Create column for metagenome dataframes
for index, row in results_by_biosample.iterrows():
# Get unique list of pathways from all identified biomolecules
genes = row['gene_info']['pathway_id'].explode().drop_duplicates().dropna().unique().tolist()
proteins = row['prot_info']['pathway_id'].explode().drop_duplicates().dropna().unique().tolist()
metabolites = row['metab_df']['pathway_id'].explode().drop_duplicates().dropna().unique().tolist()
sample_pathways = genes + metabolites + proteins
# Look at basic pathway maps, exclude KO highlighted pathway maps
sample_pathways = [item for item in sample_pathways if 'path:ko' not in item]
# Exclude "overview" pathway maps
sample_pathways = [item for item in sample_pathways if 'path:map01' not in item]
results_by_biosample.at[index,'pathways'] = sample_pathways
Now, using the complete list of pathways, make one more set of API calls to obtain the full list of compounds and KO elements found in that pathway. In this case we are only searching for KO and compound IDs, EC IDs in pathways just point to the corresponding KO element.
There is a filter applied in this step that retains pathways with at least one gene and at least one compound. Some KEGG pathways have all KO/EC elements or all compounds and in this notebook we want to examine pathways that have both.
pathway_ids = list(set(results_by_biosample['pathways'].explode().dropna()))
# get compound ids for each pathway id
compounds_from_pathways = {}
for i in range(0, len(pathway_ids), 50):
chunk = pathway_ids[i:i+50]
compounds_from_pathways.update(get_ids(chunk,'compound'))
# Format into a dataframe
compounds_from_pathways = pd.DataFrame({
'pathway_id': list(compounds_from_pathways.keys()),
'compound_id': list(compounds_from_pathways.values())
})
# Get all KO for each pathway
ko_from_pathways = {}
for i in range(0, len(pathway_ids), 50):
chunk = pathway_ids[i:i+50]
ko_from_pathways.update(get_ids(chunk,'ko'))
# Format into a dataframe
ko_from_pathways = pd.DataFrame({
'pathway_id': list(ko_from_pathways.keys()),
'ko_id': list(ko_from_pathways.values())
})
# Combine KO and compound ids for each pathway
all_pathway_elements = compounds_from_pathways.merge(ko_from_pathways, how='outer', on='pathway_id')
# Count number of compound and ko ids for each pathway
all_pathway_elements['compound_count'] = None
all_pathway_elements['ko_count'] = None
for index, row in all_pathway_elements.iterrows():
if isinstance(row['compound_id'],float) and np.isnan(row['compound_id']):
all_pathway_elements.at[index,'compound_count'] = 0
else:
all_pathway_elements.at[index,'compound_count'] = len(row['compound_id'])
if isinstance(row['ko_id'],float) and np.isnan(row['ko_id']):
all_pathway_elements.at[index,'ko_count'] = 0
else:
all_pathway_elements.at[index,'ko_count'] = len(row['ko_id'])
#count total
all_pathway_elements['total_elements'] = None
all_pathway_elements['total_elements'] = all_pathway_elements['compound_count'] + all_pathway_elements['ko_count']
all_pathway_elements
| pathway_id | compound_id | ko_id | compound_count | ko_count | total_elements | |
|---|---|---|---|---|---|---|
| 0 | path:map00010 | [cpd:C00022, cpd:C00024, cpd:C00031, cpd:C0003... | [ko:K00001, ko:K00002, ko:K00016, ko:K00114, k... | 31 | 108 | 139 |
| 1 | path:map00020 | [cpd:C00022, cpd:C00024, cpd:C00026, cpd:C0003... | [ko:K00024, ko:K00025, ko:K00026, ko:K00030, k... | 20 | 70 | 90 |
| 2 | path:map00030 | [cpd:C00022, cpd:C00031, cpd:C00085, cpd:C0011... | [ko:K00032, ko:K00033, ko:K00034, ko:K00036, k... | 37 | 89 | 126 |
| ... | ... | ... | ... | ... | ... | ... |
| 433 | path:map07216 | [cpd:C01108, cpd:C15474, cpd:C15475, cpd:C1547... | NaN | 5 | 0 | 5 |
| 434 | path:map07225 | [cpd:C00410, cpd:C00735, cpd:C00762, cpd:C01780] | NaN | 4 | 0 | 4 |
| 435 | path:map07226 | [cpd:C00410, cpd:C00468, cpd:C00535, cpd:C0095... | NaN | 5 | 0 | 5 |
436 rows Γ 6 columns
Now we will calculate the "coverage" of each KEGG pathway found in each biosample by simply dividing the number of biomolecules that were found in the omics data by the total number of elements in the pathway.
#initialize new column that will store all pathways ids for each biosample
results_by_biosample['pathway_coverage'] = None
# Create column for metagenome dataframes
for sample_index, sample_row in results_by_biosample.iterrows():
#all ko ids in biosample
all_ko_detected = list(set(list(sample_row['gene_info']['ko_id'].dropna().str.upper()) + list(sample_row['prot_info']['ko_id'].dropna().str.upper())))
#information for pathways in this biosample
pathway_coverage = all_pathway_elements[all_pathway_elements['pathway_id'].isin(sample_row['pathways'])].copy()
#biosample's metabolite compound ids
metab_cpids = sample_row['metab_df']['compound_id'].tolist()
pathway_coverage['compound_found'] = int
pathway_coverage['compound_coverage'] = int
pathway_coverage['ko_found'] = int
pathway_coverage['ko_coverage'] = int
for path_index, path_row in pathway_coverage.iterrows():
#pathway's compound ids
path_cpids = path_row['compound_id']
#pathway's ko ids
path_koids = path_row['ko_id']
if isinstance(path_cpids,float) and np.isnan(path_cpids):
pathway_coverage.at[path_index,'compound_found'] = 0
pathway_coverage.at[path_index,'compound_coverage'] = 0
else:
compound_found = len([item for item in path_cpids if item in metab_cpids])
pathway_coverage.at[path_index,'compound_found'] = compound_found
pathway_coverage.at[path_index,'compound_coverage'] = compound_found / path_row['compound_count']
if isinstance(path_koids,float) and np.isnan(path_koids):
pathway_coverage.at[path_index,'ko_found'] = 0
pathway_coverage.at[path_index,'ko_coverage'] = 0
else:
path_koids = [item.upper() for item in path_koids]
ko_found = len([item for item in path_koids if item in all_ko_detected])
pathway_coverage.at[path_index,'ko_found'] = ko_found
pathway_coverage.at[path_index,'ko_coverage'] = ko_found / path_row['ko_count']
results_by_biosample.at[sample_index,'pathway_coverage'] = pathway_coverage
display(results_by_biosample.iloc[0]['pathway_coverage'])
| pathway_id | compound_id | ko_id | compound_count | ko_count | total_elements | compound_found | compound_coverage | ko_found | ko_coverage | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | path:map00010 | [cpd:C00022, cpd:C00024, cpd:C00031, cpd:C0003... | [ko:K00001, ko:K00002, ko:K00016, ko:K00114, k... | 31 | 108 | 139 | 7 | 0.225806 | 62 | 0.574074 |
| 1 | path:map00020 | [cpd:C00022, cpd:C00024, cpd:C00026, cpd:C0003... | [ko:K00024, ko:K00025, ko:K00026, ko:K00030, k... | 20 | 70 | 90 | 5 | 0.25 | 40 | 0.571429 |
| 2 | path:map00030 | [cpd:C00022, cpd:C00031, cpd:C00085, cpd:C0011... | [ko:K00032, ko:K00033, ko:K00034, ko:K00036, k... | 37 | 89 | 126 | 12 | 0.324324 | 49 | 0.550562 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 431 | path:map07110 | [cpd:C00108, cpd:C00156] | NaN | 2 | 0 | 2 | 1 | 0.5 | 0 | 0 |
| 433 | path:map07216 | [cpd:C01108, cpd:C15474, cpd:C15475, cpd:C1547... | NaN | 5 | 0 | 5 | 1 | 0.2 | 0 | 0 |
| 435 | path:map07226 | [cpd:C00410, cpd:C00468, cpd:C00535, cpd:C0095... | NaN | 5 | 0 | 5 | 1 | 0.2 | 0 | 0 |
354 rows Γ 10 columns
#Instantiate instances for nmdc_api_utilities
from nmdc_api_utilities.biosample_search import BiosampleSearch
# since we are querying the biosample collection, we need to create an instance of it
bs_client = BiosampleSearch(env=ENV)
# create a list of lists of ids to query the BiosampleSearch object
chunked_list = split_list(results_by_biosample['biosample_id'].unique().tolist())
biosamples = []
# loop through the chunked list of ids
for chunk in chunked_list:
# create the filter - query the BiosampleSearch object looking for records
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"id": {{"$in": {filter_list}}}}}'
# get the results
biosamples += bs_client.get_record_by_filter(filter=filter,fields="id,samp_name,description,depth.has_numeric_value,lat_lon", max_page_size=100, all_pages=True)
#normalize the json columns and merge with results
results_by_biosample = pd.json_normalize(biosamples).merge(results_by_biosample,left_on='id',right_on='biosample_id',how='inner')
results_by_biosample
| id | description | samp_name | depth.has_numeric_value | lat_lon.has_raw_value | lat_lon.latitude | lat_lon.longitude | lat_lon.type | biosample_id | Annotation Enzyme Commission | ... | metag_ec_unique | metap_ko_unique | metap_ec_unique | metab_co_unique | genes_w_protein | metab_df | prot_info | gene_info | pathways | pathway_coverage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-13-0jw5n594 | Riverbed sediment samples were collected from ... | GW-RW N1_0_10 | 0.0 | 46.37380143 -119.2723688 | 46.373801 | -119.272369 | nmdc:GeolocationValue | nmdc:bsm-13-0jw5n594 | https://data.microbiomedata.org/data/nmdc:ompr... | ... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K01607, KO:K01992, KO:K01999, KO:K02058, K... | [EC:1.8.3.7, EC:3.4.21.-, EC:4.1.1.44, EC:7.4.... | [C00009, C00013, C00016, C00025, C00026, C0003... | 9.0 | compound_id compound_trimmed \ 0 cpd:C... | ko_id ec_id module_id \ 0 KO... | ko_id ec_id module_id \ ... | [path:map00260, path:map00270, path:map00680, ... | pathway_id ... |
| 1 | nmdc:bsm-13-13145k83 | Riverbed sediment samples were collected from ... | GW-RW N3_20_30 | 0.2 | 46.37379502 -119.2722385 | 46.373795 | -119.272239 | nmdc:GeolocationValue | nmdc:bsm-13-13145k83 | https://data.microbiomedata.org/data/nmdc:ompr... | ... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00368, KO:K00958, KO:K01996, KO:K02058, K... | [EC:1.20.4.4, EC:1.7.2.1, EC:2.7.7.4, EC:6.5.1.8] | [C00009, C00013, C00026, C00037, C00041, C0004... | 11.0 | compound_id compound_trimmed \ 0 cpd:C... | ko_id ec_id module... | ko_id ec_id ... | [path:map00230, path:map00240, path:map00760, ... | pathway_id ... |
| 2 | nmdc:bsm-13-1p0tct86 | Riverbed sediment samples were collected from ... | GW-RW N1_20_30 | 0.2 | 46.37380143 -119.2723688 | 46.373801 | -119.272369 | nmdc:GeolocationValue | nmdc:bsm-13-1p0tct86 | https://data.microbiomedata.org/data/nmdc:ompr... | ... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00023, KO:K00077, KO:K00114, KO:K00138, K... | [EC:1.1.1.169, EC:1.1.1.36, EC:1.1.2.10, EC:1.... | [C00009, C00013, C00016, C00020, C00026, C0003... | 100.0 | compound_id compound_trimmed \ 0 cpd:C... | ko_id ec_id ... | ko_id ec_id mod... | [path:map04016, path:map04978, path:map00230, ... | pathway_id ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28 | nmdc:bsm-13-xvf06c81 | Riverbed sediment samples were collected from ... | GW-RW N2_40_50 | 0.4 | 46.37380273 -119.2723038 | 46.373803 | -119.272304 | nmdc:GeolocationValue | nmdc:bsm-13-xvf06c81 | https://data.microbiomedata.org/data/nmdc:ompr... | ... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00113, KO:K00114, KO:K00124, KO:K00244, K... | [EC:1.1.2.10, EC:1.1.2.8, EC:1.11.1.24, EC:1.1... | [C00013, C00016, C00041, C00047, C00049, C0005... | 91.0 | compound_id compound_trimmed \ 0 cpd:C... | ko_id ec_id module_id \ 0 ... | ko_id ec_id module_id ... | [path:map00790, path:map04122, path:map04016, ... | pathway_id ... |
| 29 | nmdc:bsm-13-yjp2dx65 | Riverbed sediment samples were collected from ... | GW-RW N1_10_20 | 0.1 | 46.37380143 -119.2723688 | 46.373801 | -119.272369 | nmdc:GeolocationValue | nmdc:bsm-13-yjp2dx65 | https://data.microbiomedata.org/data/nmdc:ompr... | ... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00027, KO:K00031, KO:K00033, KO:K00059, K... | [EC:1.1.1.-, EC:1.1.1.100, EC:1.1.1.291, EC:1.... | [C00009, C00013, C00016, C00020, C00025, C0002... | 137.0 | compound_id compound_trimmed \ 0 cpd:C... | ko_id ec_id ... | ko_id ec_id mo... | [path:map00230, path:map00240, path:map00760, ... | pathway_id ... |
| 30 | nmdc:bsm-13-zms1jq91 | Riverbed sediment samples were collected from ... | GW-RW S3_50_60 | 0.5 | 46.37228508 -119.2716818 | 46.372285 | -119.271682 | nmdc:GeolocationValue | nmdc:bsm-13-zms1jq91 | https://data.microbiomedata.org/data/nmdc:ompr... | ... | [EC:1.-, EC:1.1.-, EC:1.1.1.-, EC:1.1.1.1, EC:... | [KO:K00008, KO:K00362, KO:K00368, KO:K00518, K... | [EC:1.1.1.14, EC:1.1.1.47, EC:1.1.2.10, EC:1.1... | [C00009, C00013, C00016, C00025, C00042, C0004... | 87.0 | compound_id compound_trimmed \ 0 cpd:C... | ko_id ec_id modu... | ko_id ec_id module_id \ ... | [path:map00190, path:map00010, path:map00260, ... | pathway_id ... |
31 rows Γ 28 columns
For this study, the samples were collected in two areas along the Columbia River. Create a dataframe of colors representing which area each sample was taken from to use in heatmaps.
#round latitude for grouping
results_by_biosample['latitude_rounded'] = round(results_by_biosample['lat_lon.latitude'],ndigits=3)
# Set up color list for side grouping column
sample_color=pd.DataFrame({'latitude_rounded':results_by_biosample['latitude_rounded'].unique(),'color':['orange','blue']})
#setup color legend for sample type
type_col_dict = dict(zip(sample_color['latitude_rounded'].unique(), ['orange','blue']))
handles = [Patch(facecolor=type_col_dict[name]) for name in type_col_dict]
#map graph colors based on sample type to Ids
sample_type=results_by_biosample[['latitude_rounded','biosample_id']].drop_duplicates()
sample_type_col=sample_type.merge(sample_color,how='left',on='latitude_rounded').set_index('biosample_id').drop('latitude_rounded',axis=1).rename(columns={'color':'latitude_rounded'})
sample_type_col
| latitude_rounded | |
|---|---|
| biosample_id | |
| nmdc:bsm-13-0jw5n594 | orange |
| nmdc:bsm-13-13145k83 | orange |
| nmdc:bsm-13-1p0tct86 | orange |
| ... | ... |
| nmdc:bsm-13-xvf06c81 | orange |
| nmdc:bsm-13-yjp2dx65 | orange |
| nmdc:bsm-13-zms1jq91 | blue |
31 rows Γ 1 columns
Create pathway coverage heatmapsΒΆ
In section 5 we created a list of dataframes containing the coverage of each pathway in each biosample, broken down by compound, KO, and total coverage.
We can visualize this as a heatmap, where the columns are KEGG pathways, the rows are biosamples, and the color gradient shows pathways with high/low coverage in each sample.
Here we look at the KO term coverage (the proportion of KO terms in the pathway that appear as identifications in our metagenomic and/or metaproteomic data).
# explode pathway coverage information, keeping grouping and id information
exploded_pathway_coverage = []
for sample_index, sample_row in results_by_biosample.iterrows():
pathway_coverage = sample_row['pathway_coverage']
pathway_coverage['biosample_id'] = sample_row['id']
exploded_pathway_coverage.append(pathway_coverage)
exploded_pathway_coverage = pd.concat(exploded_pathway_coverage,ignore_index=True)
exploded_pathway_coverage = pd.DataFrame(exploded_pathway_coverage)
#create KO matrix
ko_matrix=exploded_pathway_coverage[['biosample_id','pathway_id','ko_coverage']].pivot_table('ko_coverage', index='biosample_id', columns='pathway_id').astype(float).fillna(0)
#create compound matrix
compound_matrix=exploded_pathway_coverage[['biosample_id','pathway_id','compound_coverage']].pivot_table('compound_coverage', index='biosample_id', columns='pathway_id').astype(float).fillna(0)
#heatmap
ko_heatmap = sns.clustermap(data=ko_matrix,row_colors=sample_type_col,cmap='mako',xticklabels=False,yticklabels=False)
ko_heatmap.figure.suptitle("KEGG pathway coverage across biosamples (KO terms only)",y=1.05)
ko_heatmap.ax_heatmap.set_xlabel("KEGG Pathways")
ko_heatmap.ax_heatmap.set_ylabel("Biosamples")
#add cmap title
ko_heatmap.ax_cbar.set_title('coverage')
#add sample type legend
plt.legend(handles, type_col_dict, title='',
bbox_to_anchor=(0.22, 0.12), bbox_transform=plt.gcf().transFigure)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg) /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg)
<matplotlib.legend.Legend at 0x7f82cec4b2d0>
The KO coverage heatmap has some features that stand out but does not show much difference in coverage between the sample groups, seen by the lack of clustering on the sidebar.
Now look at only compound coverage values (the proportion of compounds in each pathway that appear as identifications in the metabolomics results).
#heatmap
compound_heatmap = sns.clustermap(data=compound_matrix,row_colors=sample_type_col,cmap='mako',xticklabels=False,yticklabels=False)
compound_heatmap.figure.suptitle("KEGG pathway coverage across biosamples (Compounds only)",y=1.05)
compound_heatmap.ax_heatmap.set_xlabel("KEGG Pathways")
compound_heatmap.ax_heatmap.set_ylabel("Biosamples")
#add cmap title
compound_heatmap.ax_cbar.set_title('coverage')
#add sample type legend
plt.legend(handles, type_col_dict, title='',
bbox_to_anchor=(0.22, 0.12), bbox_transform=plt.gcf().transFigure)
/opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg) /opt/hostedtoolcache/Python/3.11.14/x64/lib/python3.11/site-packages/seaborn/matrix.py:560: UserWarning: Clustering large matrix with scipy. Installing `fastcluster` may give better performance. warnings.warn(msg)
<matplotlib.legend.Legend at 0x7f82cede3950>
The compound-only coverage dataframe shows the most clustering of samples by the sampling area.
Highest coverage pathwaysΒΆ
Finally, look at trends across all the samples, such as the pathways with the highest average total coverage (calculated from all biomolecules rather than separating KEGG orthologs and compounds as in the heatmaps.) The cell below prints pathways that had over 50% coverage in any sample, ordered by average total coverage.
We can also use the KEGG pathway documentation to find some more human-readable information for each of these pathway IDs. The pathways are grouped into categories. See the table of contents at https://www.genome.jp/kegg/pathway.html.
#calculate total coverage
exploded_pathway_coverage['total_coverage'] = (exploded_pathway_coverage['compound_found']+exploded_pathway_coverage['ko_found'])/exploded_pathway_coverage['total_elements']
#filter to pathways that have 50% coverage in at least one sample
high_coverage_pathways = exploded_pathway_coverage[['biosample_id','pathway_id','total_coverage']].\
groupby('pathway_id').filter(lambda x: (x['total_coverage'] >= 0.5).any()).reset_index(drop=True)
#order by average coverage
high_coverage_pathways['average_coverage'] = high_coverage_pathways.groupby('pathway_id')['total_coverage'].transform('mean').reset_index(drop=True)
high_coverage_pathways = high_coverage_pathways.sort_values(by='average_coverage',ascending=False).reset_index(drop=True)
# Create better labelling for plot
top_pathways = pd.DataFrame({
'pathway_id' : ["path:map00571", "path:map00710", "path:map00020", "path:map00220", "path:map00770",
"path:map00250", "path:map00260", "path:map04122", "path:map00010"],
'pathway_category' : ["1.7 Glycan biosynthesis and metabolism", "1.2 Energy metabolism",
"1.1 Carbohydrate metabolism", "1.5 Amino acid metabolism",
"1.8 Metabolism of cofactors and vitamins", "1.5 Amino acid metabolism",
"1.5 Amino acid metabolism", "2.3 Folding, sorting, and degradation",
"1.1 Carbohydrate metabolism"],
'pathway_name' : ["Lipoarabinomannan (LAM) biosynthesis", "Carbon fixation by Calvin cycle",
"Citrate cycle (TCA cycle)", "Arginine biosynthesis", "Pantothenate and CoA biosynthesis",
"Alanine, aspartate and glutamate metabolism", "Glycine, serine and threonine metabolism",
"Sulfur relay system", "Gluconeogenesis"]})
high_coverage_pathways = high_coverage_pathways.merge(top_pathways,on='pathway_id',how='inner')
high_coverage_pathways
| biosample_id | pathway_id | total_coverage | average_coverage | pathway_category | pathway_name | |
|---|---|---|---|---|---|---|
| 0 | nmdc:bsm-13-ay11wa90 | path:map00571 | 0.5 | 0.572581 | 1.7 Glycan biosynthesis and metabolism | Lipoarabinomannan (LAM) biosynthesis |
| 1 | nmdc:bsm-13-5bfhqy53 | path:map00571 | 0.6875 | 0.572581 | 1.7 Glycan biosynthesis and metabolism | Lipoarabinomannan (LAM) biosynthesis |
| 2 | nmdc:bsm-13-kcdh3w94 | path:map00571 | 0.5 | 0.572581 | 1.7 Glycan biosynthesis and metabolism | Lipoarabinomannan (LAM) biosynthesis |
| ... | ... | ... | ... | ... | ... | ... |
| 276 | nmdc:bsm-13-13145k83 | path:map00010 | 0.467626 | 0.475052 | 1.1 Carbohydrate metabolism | Gluconeogenesis |
| 277 | nmdc:bsm-13-mvpx4z13 | path:map00010 | 0.510791 | 0.475052 | 1.1 Carbohydrate metabolism | Gluconeogenesis |
| 278 | nmdc:bsm-13-c0xjg970 | path:map00010 | 0.431655 | 0.475052 | 1.1 Carbohydrate metabolism | Gluconeogenesis |
279 rows Γ 6 columns
# Create violin plot
violin_plt = sns.violinplot(x='pathway_name', y='total_coverage', hue='pathway_category', data=high_coverage_pathways)
sns.move_legend(violin_plt, "upper left", bbox_to_anchor=(1, 1))
plt.xticks(rotation=90)
plt.show()
Licensing InformationΒΆ
This notebook uses the R packages KEGGREST to interface with the KEGG API. Use of the KEGG API is restricted to academic users. Non-academic users must obtain a commercial license. (See https://www.kegg.jp/kegg/legal.html) The National Microbiome Data Collaborative maintains a paid license to use KEGG (more information).
Tenenbaum D, Maintainer B (2024). KEGGREST: Client-side REST access to the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package version 1.46.0, https://bioconductor.org/packages/KEGGREST.