Proteomic Data Aggregation and Visualization¶
This notebook demonstrates aggregation of proteomic data via the National Microbiome Data Collaborative (NMDC)'s Runtime API. It uses the nmdc_api_utilities package (as of March 2025) to do this exploration. It involves using nmdc_api_utilites objects to make NMDC API requests easier, and using utility function to help us with data processing tasks. More information about the package can be found here. It highlights how the NMDC's schema can be used to overcome some of the numerous challenges associated with this type of aggregation. Please note that this notebook is intended for individuals with experience performing mass spectrometry based proteomic analyses and that various parameter and processing choices were made for this example use case. They are not broadly applicable and should be adjusted as needed.
Notebook Steps:
Assess background information and collect datasets for an example study of riverbed sediment along the Columbia River
Apply a spectral probability filter across the data that optimizes the number of identifications for an FDR of 0.05
Collapse to unique peptides and normalize quantification
Extract functional gene annotations for proteins
Generate annotation and protein mappings for peptides using "Razor" strategy
Perform protein rollup using the "Razor" results and summarize into an aggregated table of relative protein abundance
Define a function to split a list into chunks¶
Since we will need to use a list of ids to query a new collection in the API, we need to limit the number of ids we put in a query. This function splits a list into chunks of 100. Note that the chunk_size has a default of 100, but can be adjusted.
# Define a function to split ids into chunks
def split_list(input_list, chunk_size=100):
result = []
for i in range(0, len(input_list), chunk_size):
result.append(input_list[i:i + chunk_size])
return result
Import libraries and python scripts containing functions necessary to run this notebook. 'aggregation_functions.py' (also in this folder) includes spectral probability filtering and protein mapping functions. nmdc_api_utilities includes functions for API traversal of the collections endpoint.
import requests
from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sys
from IPython.display import Image, display
import nmdc_api_utilities
if 'google.colab' in sys.modules:
#module in this folder with specific protein aggregation functions
!wget https://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/proteomic_aggregation/python/aggregation_functions.py
import aggregation_functions as agg_func
#pmartR logo
!wget https://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/proteomic_aggregation/pmartR_logo_final.jpg
else:
#module in this folder with specific protein aggregation functions
import aggregation_functions as agg_func
1) Assess background information and collect data for an example study of riverbed sediment along the Columbia River¶
Review the example study on the NMDC data portal. Use the study id embedded in the url (nmdc:sty-11-aygzgv51) to collect all related data objects via nmdc_api_utilities package and reformat the json output into a pandas dataframe. These data objects reference both input files (i.e. raw, gff) and output files (i.e. metaproteomic results) to the NMDC workflows. To get the data_objects, we will create a DataObjectSearch object and use the get_data_objects_for_studies() function.
pd.set_option("display.max_rows", 6)
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch(env=ENV)
#get all data objects associated with this study id using the NMDC find endpoint
data = do_client.get_data_objects_for_studies(study_id='nmdc:sty-11-aygzgv51')
data = pd.DataFrame(data)
#reformat data into dataframe (keeping biosample id)
data_objects=[]
for index, row in data.iterrows():
bio_id = row['biosample_id']
row_out = pd.json_normalize(row['data_objects'])
row_out['biosample_id'] = bio_id
data_objects.append(row_out)
data_objects = pd.concat(data_objects)[['id','name','file_size_bytes','data_object_type','md5_checksum','url','biosample_id','in_manifest']]
display(data_objects)
del data, index, row, row_out, bio_id
| id | name | file_size_bytes | data_object_type | md5_checksum | url | biosample_id | in_manifest | |
|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-gxq95j86 | nmdc_wfmgan-11-bnzfrb88.1_prodigal.gff | 90723471 | Prodigal Annotation GFF | 47f381a56527a186939f638fa0486d5e | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-0jw5n594 | NaN |
| 1 | nmdc:dobj-11-b8aph505 | nmdc_wfmgan-11-bnzfrb88.1_imgap.info | 470 | Annotation Info File | 2854a85740fdd20dd61bbe5c9897cc24 | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-0jw5n594 | NaN |
| 2 | nmdc:dobj-11-td9k2w62 | 511406_Froze_Core_2015_N1_0_10_13_Lip_POS_28Ju... | 12668 | Configuration toml | 6c5596d6f9f0446f1a0a628ff25a860a | https://nmdcdemo.emsl.pnnl.gov/lipidomics/steg... | nmdc:bsm-13-0jw5n594 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7 | nmdc:dobj-13-kay9ez18 | Unground_SBR_Spring_2014_FC_N3_20-30_MeOHExt_2... | 74228 | Direct Infusion FT-ICR MS Analysis Results | afc41e08ea7aba78e8fe3d3489db64c0 | https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung... | nmdc:bsm-13-zq681s85 | NaN |
| 8 | nmdc:dobj-13-c6rdvy62 | output: Unground_SBR_Spring_2014_FC_N3_20-30_H... | 30556452 | Direct Infusion FT ICR-MS Raw Data | NaN | NaN | nmdc:bsm-13-zq681s85 | NaN |
| 9 | nmdc:dobj-13-xq8sqj89 | Unground_SBR_Spring_2014_FC_N3_20-30_H2Oext_22... | 10636 | Direct Infusion FT-ICR MS Analysis Results | a47e870bdf10041849e36f66ab34177a | https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung... | nmdc:bsm-13-zq681s85 | NaN |
3515 rows × 8 columns
Subset the data objects to 'Unfiltered Metaproteomic Results'. These files contain the proteomic workflow outputs that will be used for proteomic aggregation.
proteomic_output_df = data_objects[data_objects['data_object_type']=='Unfiltered Metaproteomics Results'].reset_index(drop=True).rename(columns={'id':'processed_DO_id'})
#how many result files exist for each biosample?
display(proteomic_output_df.groupby('biosample_id')['data_object_type'].value_counts().reset_index())
| biosample_id | data_object_type | count | |
|---|---|---|---|
| 0 | nmdc:bsm-13-0jw5n594 | Unfiltered Metaproteomics Results | 2 |
| 1 | nmdc:bsm-13-13145k83 | Unfiltered Metaproteomics Results | 2 |
| 2 | nmdc:bsm-13-1p0tct86 | Unfiltered Metaproteomics Results | 2 |
| ... | ... | ... | ... |
| 28 | nmdc:bsm-13-xvf06c81 | Unfiltered Metaproteomics Results | 2 |
| 29 | nmdc:bsm-13-yjp2dx65 | Unfiltered Metaproteomics Results | 2 |
| 30 | nmdc:bsm-13-zms1jq91 | Unfiltered Metaproteomics Results | 2 |
31 rows × 3 columns
Since there is more than one result file for each biosample, choose a workflow version to use in the analysis.
#Instantiate instances for nmdc_api_utilities
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
from nmdc_api_utilities.data_processing import DataProcessing
# since we are querying the workflow collection, we need to create an instance of it
we_client = WorkflowExecutionSearch(env=ENV)
# create an instance of the data processing object
dp_client = DataProcessing()
# create a list of lists of processed proteomic result ids to query the workflow execution objects
chunked_list = split_list(proteomic_output_df['processed_DO_id'].unique().tolist())
version_output = []
# loop through the chunked list of ids
for chunk in chunked_list:
# create the filter - query the WorkflowExecutionSearch object looking for records
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
# get the results
version_output += we_client.get_record_by_filter(filter=filter,fields="has_output,version", max_page_size=100, all_pages=True)
#normalize the json columns and merge
proteomic_output_df = pd.json_normalize(version_output).rename(columns={'id':'workflow_id','version':'workflow_version'}).explode('has_output').drop_duplicates().merge(proteomic_output_df, left_on='has_output',right_on='processed_DO_id').drop('has_output',axis=1)
proteomic_output_df = proteomic_output_df[proteomic_output_df['workflow_version'].str.contains('v1.2.1',na=False)]
proteomic_output_df
| workflow_id | workflow_version | processed_DO_id | name | file_size_bytes | data_object_type | md5_checksum | url | biosample_id | in_manifest | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | nmdc:wfmp-11-j2tktj20.1 | v1.2.1 | nmdc:dobj-11-2k6s1505 | nmdc_dobj-11-ms76kj12_nmdc_dobj-11-c5av0320_ms... | 5471346 | Unfiltered Metaproteomics Results | 1badc622f20cf869077284966bb2810f | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | nmdc:bsm-13-8e1rjf10 | [nmdc:manif-11-7796sg87] |
| 4 | nmdc:wfmp-11-3z777070.1 | v1.2.1 | nmdc:dobj-11-2szyc560 | nmdc_dobj-11-tezmee57_nmdc_dobj-11-w2c1ee09_ms... | 8117239 | Unfiltered Metaproteomics Results | fd8a212ad33a8162857d0ca164760333 | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | nmdc:bsm-13-vqk1y255 | [nmdc:manif-11-7796sg87] |
| 5 | nmdc:wfmp-11-7h4m6t36.1 | v1.2.1 | nmdc:dobj-11-301yh759 | nmdc_dobj-11-zd25ft61_nmdc_dobj-11-vek6s281_ms... | 10756563 | Unfiltered Metaproteomics Results | c82e8026a2d974cb4f9a75b6eef5c62b | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | nmdc:bsm-13-yjp2dx65 | [nmdc:manif-11-7796sg87] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59 | nmdc:wfmp-11-k5qcay19.1 | v1.2.1 | nmdc:dobj-11-z8hc0j30 | nmdc_dobj-11-hp2vmj63_nmdc_dobj-11-716e7y35_ms... | 8376109 | Unfiltered Metaproteomics Results | 8785b9538d977a50fd35d11bfcb037ce | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | nmdc:bsm-13-2fw5j754 | [nmdc:manif-11-7796sg87] |
| 60 | nmdc:wfmp-11-f6sfn088.1 | v1.2.1 | nmdc:dobj-11-zhz66007 | nmdc_dobj-11-mzxj8743_nmdc_dobj-11-pxd5ez06_ms... | 9983187 | Unfiltered Metaproteomics Results | d596283fd19d30d8708afab0a8677e67 | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | nmdc:bsm-13-as680w10 | [nmdc:manif-11-7796sg87] |
| 61 | nmdc:wfmp-11-xm389s70.1 | v1.2.1 | nmdc:dobj-11-zn9eyg46 | nmdc_dobj-11-6n49rm30_nmdc_dobj-11-nb9f8z34_ms... | 2321178 | Unfiltered Metaproteomics Results | 8eb29a562832f1a8862a5a643e297389 | https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... | nmdc:bsm-13-sepcxd06 | [nmdc:manif-11-7796sg87] |
31 rows × 10 columns
There are various requirements that enable mass spectrometry runs to be aggregated and analyzed together. For example, runs need to be performed in succession, on the same instrument. The NMDC schema can make it easier to find these proteomic results by linking them via a slot called in_manifest.
Look at the in_manifest id on these proteomic outputs to confirm that all runs are in the same manifest record, and pull that record. If that manifest record's manifest_category value is 'instrument_run', then it confirms that these are LC-MS/MS runs that were performed in succession on the same instrument. Proteomic outputs from different manifest records should not be aggregated.
Using the nmdc_api_utilities package, we create a ManifestSearch object, create the needed filter, and use the get_record_by_filter() function to get records.
from nmdc_api_utilities.manifest_search import ManifestSearch
#extract in_manifest ids for this study (in this case a single one available)
manifest_id = proteomic_output_df.explode('in_manifest')['in_manifest'].unique()
# since we are querying the manifest collection, we need to create an instance of it
m_client = ManifestSearch(env=ENV)
# create a list of lists of ids to query the ManifestSearch object
chunked_list = split_list(proteomic_output_df['in_manifest'].explode().unique().tolist())
manifest = []
# loop through the chunked list of ids
for chunk in chunked_list:
# create the filter - query the ManifestSearch object looking for recordds where the id field is in the chunk of ids from proteomic_output_df
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"id": {{"$in": {filter_list}}}}}'
# get the results
manifest += m_client.get_record_by_filter(filter=filter, max_page_size=100, all_pages=True)
display(manifest)
[{'id': 'nmdc:manif-11-7796sg87',
'manifest_category': 'instrument_run',
'type': 'nmdc:Manifest',
'description': 'collection of metaproteomic analyses from the same instrument run nmdc:sty-11-aygzgv51'}]
Look at an example of the information in 'Unfiltered Metaproteomics Results', which contains peptide identification and relative abundance information.
#example unfiltered results
unfilt_results = proteomic_output_df.iloc[0]["url"]
print(unfilt_results)
display(agg_func.tsv_extract(unfilt_results))
del unfilt_results
https://nmdcdemo.emsl.pnnl.gov/proteomics/results/nmdc_dobj-11-ms76kj12_nmdc_dobj-11-c5av0320_msgfplus_syn_PlusSICStats.txt
| ResultID | Scan | FragMethod | SpecIndex | Charge | PrecursorMZ | DelM | DelM_PPM | MH | Peptide | ... | PeakMaxIntensity | PeakSignalToNoiseRatio | FWHMInScans | PeakArea | ParentIonIntensity | ParentIonMZ | StatMomentsArea | PeakScanStart | PeakScanEnd | PeakWidthMinutes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 36947 | HCD | 1 | 3 | 968.47449 | -0.00469 | -1.61740 | 2902.410370 | K.NYSPYYNTIDDLKDQIVDLTVGNNK.T | ... | 6619700.0 | 106.00 | 62 | 503970000.0 | 4468100.0 | 968.47 | 453600000.0 | 36866 | 37232 | 0 |
| 1 | 2 | 30576 | HCD | 2 | 3 | 1017.16516 | -0.01640 | -5.38151 | 3048.494105 | K.VLYDAEISQIHQSVTDTNVILSMDNSR.N | ... | 2595300.0 | 343.30 | 115 | 227780000.0 | 2481100.0 | 1017.17 | 206570000.0 | 30497 | 30787 | 0 |
| 2 | 3 | 21671 | HCD | 3 | 3 | 1050.48071 | 0.02323 | 7.38083 | 3148.400992 | K.NADSTLHTVTSGTAEGGESGTVFDSSYMAAGK.T | ... | 2393100.0 | 165.90 | 14 | 51712000.0 | 1454200.0 | 1050.48 | 53478000.0 | 21598 | 21730 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22014 | 22015 | 10397 | HCD | 19177 | 2 | 535.74500 | -0.00116 | -1.08134 | 1069.480463 | K.DMMELFQR.A | ... | 1673900.0 | 37.75 | 213 | 246200000.0 | 2129400.0 | 535.74 | 230580000.0 | 10205 | 10427 | 0 |
| 22015 | 22016 | 7224 | HCD | 19179 | 3 | 525.91705 | -0.01298 | -8.23982 | 1575.749593 | K.VEETEKGEEQLER.L | ... | 2701400.0 | 20.38 | 104 | 291450000.0 | 2969100.0 | 526.25 | 241570000.0 | 7097 | 7395 | 0 |
| 22016 | 22017 | 32524 | HCD | 19178 | 2 | 1130.05615 | -0.00976 | -4.32084 | 2259.114693 | -.SDIPGQDCKMIFGHLDELLK.I | ... | 736733.0 | 24.86 | 59 | 32040000.0 | 812098.0 | 1130.06 | 27600000.0 | 32481 | 32598 | 0 |
22017 rows × 36 columns
Extract information from all 33 proteomic results via the function iterate_file_extract() in agg_func, and put them into a single dataframe, where each scan in each dataset has the unique identifier SpecID. Clean prefix and suffix off of each peptide sequence. Since this data was processed using a target-decoy approach, determine the type of protein being matched to each peptide: contaminant, reverse (false positive match to the reversed amino acid sequence of a protein), or forward (match to the true, forward amino acid sequence of a protein). The presence of forward and reverse matches enables FDR estimation in the next step.
unfilt_res = agg_func.iterate_file_extract(identifier_col='processed_DO_id',
url_col='url',
extract_cols=['Charge','Scan','Peptide','Protein','MSGFDB_SpecEValue','StatMomentsArea'],
pd_df=proteomic_output_df,
filter_col = None,
filter_values = None,
file_type='tsv'
)
#create identifier for each scan in each dataset
unfilt_res["SpecID"] = unfilt_res.apply(lambda row: str(row["id_col"]) + "_" + str(row["Scan"]), axis=1)
#clean off the prefix and suffix from the sequence but keep any mods
unfilt_res["Peptide Sequence with Mods"] = unfilt_res["Peptide"].apply(agg_func.sequence_noprefsuff)
del unfilt_res['Peptide']
#determine protein type (contaminant, reverse, forward)
unfilt_res["Protein_Type"] = unfilt_res["Protein"].apply(agg_func.findproteinname)
unfilt_res
| Charge | Scan | Protein | MSGFDB_SpecEValue | StatMomentsArea | id_col | SpecID | Peptide Sequence with Mods | Protein_Type | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 36947 | Contaminant_K1C9_HUMAN | 1.454800e-31 | 453600000.0 | nmdc:dobj-11-2k6s1505 | nmdc:dobj-11-2k6s1505_36947 | NYSPYYNTIDDLKDQIVDLTVGNNK | None |
| 1 | 3 | 30576 | Contaminant_K22E_HUMAN | 7.111100e-31 | 206570000.0 | nmdc:dobj-11-2k6s1505 | nmdc:dobj-11-2k6s1505_30576 | VLYDAEISQIHQSVTDTNVILSMDNSR | None |
| 2 | 3 | 21671 | nmdc:wfmgan-11-a71gkg84.1_003113_596_1117 | 1.396200e-28 | 53478000.0 | nmdc:dobj-11-2k6s1505 | nmdc:dobj-11-2k6s1505_21671 | NADSTLHTVTSGTAEGGESGTVFDSSYMAAGK | Forward |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 945188 | 2 | 14621 | XXX_nmdc:wfmgan-11-hyze3t58.1_29324_190_333 | 4.997800e-07 | 336710000.0 | nmdc:dobj-11-zn9eyg46 | nmdc:dobj-11-zn9eyg46_14621 | FSFFYDMR | Reversed |
| 945189 | 2 | 24865 | nmdc:wfmgan-11-hyze3t58.1_08744_41_232 | 4.998600e-07 | 363810000.0 | nmdc:dobj-11-zn9eyg46 | nmdc:dobj-11-zn9eyg46_24865 | RTMTEYGHK | Forward |
| 945190 | 2 | 15161 | nmdc:wfmgan-11-hyze3t58.1_72540_2_109 | 4.999300e-07 | 25782000.0 | nmdc:dobj-11-zn9eyg46 | nmdc:dobj-11-zn9eyg46_15161 | WFFIYK | Forward |
945191 rows × 9 columns
2) Apply a spectral probability filter across the data that optimizes the number of identifications for an FDR of 0.05¶
A challenge associated with aggregating mass spectrometry data is that there are always false identifications, which can be mitigated by imposing a spectral probability filter on the data being analyzed. The same spectral probability filter needs to be applied across datasets when they are being compared. The filter value itself is chosen by weighing the number of 'true' identifications retained with the proximity of the data to a chosen false discovery rate (FDR) (usually 0.05 or 0.01). NMDC's metaproteomic workflow provides 'true' and 'false' identifications for FDR estimation in the 'Unfiltered Metaproteomic Result' files.
Create a dataframe of peptide identifications (ignoring protein mapping). Filter identifications to the peptide sequence with the smallest SpecEValue for each SpecID, so there is a single, highest probability identification for each scan.
edata = unfilt_res[['SpecID','Peptide Sequence with Mods','MSGFDB_SpecEValue','Protein_Type','StatMomentsArea']].drop_duplicates() # important to remove any duplicated rows here!
#for each SpecID, select the peptide spectrum match with the smallest MSGFDB_SpecEValue (.idxmin() takes the first entry when there's multiple matches)
idx = edata.groupby(['SpecID'])['MSGFDB_SpecEValue'].idxmin()
edata = edata.loc[idx].reset_index(drop=True)
del idx
display(edata)
assert len(edata['SpecID'].unique())==edata.shape[0], "still more than one identification per scan"
| SpecID | Peptide Sequence with Mods | MSGFDB_SpecEValue | Protein_Type | StatMomentsArea | |
|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-2k6s1505_10005 | PPDERERSEEAEKRDEERDRVRDELLAGAEEGEPR | 4.994300e-07 | Forward | 1.238500e+09 |
| 1 | nmdc:dobj-11-2k6s1505_10008 | LRAGSEPR | 1.172600e-07 | Forward | 1.234600e+08 |
| 2 | nmdc:dobj-11-2k6s1505_10009 | WAKEIENQK | 2.473900e-07 | Forward | 5.731600e+07 |
| ... | ... | ... | ... | ... | ... |
| 797680 | nmdc:dobj-11-zn9eyg46_9976 | CRDKWYQMAEGK | 4.110300e-07 | Reversed | 6.141800e+06 |
| 797681 | nmdc:dobj-11-zn9eyg46_9979 | WWQQINK | 4.622300e-08 | Reversed | 4.143000e+07 |
| 797682 | nmdc:dobj-11-zn9eyg46_9980 | RFEECERECK | 1.126600e-07 | Forward | 2.284500e+06 |
797683 rows × 5 columns
Create separate dataframes of forward and reverse peptide spectrum matches.
forward_peptides = edata[edata["Protein_Type"] == "Forward"].copy().reset_index(drop=True)
del forward_peptides["Protein_Type"]
display(forward_peptides)
| SpecID | Peptide Sequence with Mods | MSGFDB_SpecEValue | StatMomentsArea | |
|---|---|---|---|---|
| 0 | nmdc:dobj-11-2k6s1505_10005 | PPDERERSEEAEKRDEERDRVRDELLAGAEEGEPR | 4.994300e-07 | 1.238500e+09 |
| 1 | nmdc:dobj-11-2k6s1505_10008 | LRAGSEPR | 1.172600e-07 | 1.234600e+08 |
| 2 | nmdc:dobj-11-2k6s1505_10009 | WAKEIENQK | 2.473900e-07 | 5.731600e+07 |
| ... | ... | ... | ... | ... |
| 401082 | nmdc:dobj-11-zn9eyg46_9959 | DAVSGRHHAQGR | 6.589700e-09 | 1.566600e+07 |
| 401083 | nmdc:dobj-11-zn9eyg46_9967 | FYRFSGR | 1.080500e-07 | 3.418000e+08 |
| 401084 | nmdc:dobj-11-zn9eyg46_9980 | RFEECERECK | 1.126600e-07 | 2.284500e+06 |
401085 rows × 4 columns
reversed_peptides = edata[edata["Protein_Type"] == "Reversed"].copy().reset_index(drop=True)
del reversed_peptides["Protein_Type"]
del edata
display(reversed_peptides)
| SpecID | Peptide Sequence with Mods | MSGFDB_SpecEValue | StatMomentsArea | |
|---|---|---|---|---|
| 0 | nmdc:dobj-11-2k6s1505_10013 | EFVDIISYMENENHSDIEYPLLYKWDSKSTVINR | 1.021500e-07 | 7.371400e+08 |
| 1 | nmdc:dobj-11-2k6s1505_10019 | QWHPNFLR | 3.448500e-07 | 3.220300e+09 |
| 2 | nmdc:dobj-11-2k6s1505_10026 | LAEREGGAR | 1.451200e-08 | 9.604800e+08 |
| ... | ... | ... | ... | ... |
| 383749 | nmdc:dobj-11-zn9eyg46_9969 | TDTATHWHK | 1.879300e-07 | 1.960000e+08 |
| 383750 | nmdc:dobj-11-zn9eyg46_9976 | CRDKWYQMAEGK | 4.110300e-07 | 6.141800e+06 |
| 383751 | nmdc:dobj-11-zn9eyg46_9979 | WWQQINK | 4.622300e-08 | 4.143000e+07 |
383752 rows × 4 columns
Use the function optimize_specFilt() in agg_func to find a log10 spectral probability filter that weighs the number of forward peptides retained with the proximity of the dataset to a 0.05 spectral FDR. Visualize the impact of the spectral probability filter by plotting the number of forward and reverse peptides retained.
The main plot below is a histogram of forward and reverse peptides across all spectral probability values. The inset within this plot depicts a subset of the smallest spectral probabililty values, with the red bar before the dashed line representing the estimated number of false identifications that will be included in this analysis.
#initial guess at a log10 spectral probability filter value
initial_specprob_filter = -15
#perform optimization
optimization = agg_func.optimize_specFilt(initial_specprob_filter,forward_peptides,reversed_peptides)
#visualize filter
fitted_params=optimization.x
fig, ax_main = plt.subplots()
ax_inset = plt.axes([0.45, 0.45, 0.35, 0.35])
# Main plot
# forward peptides
hist, bins = np.histogram(forward_peptides["MSGFDB_SpecEValue"], bins=50)
ax_main.bar(bins[:-1], hist, width=np.diff(bins), align='edge', color='g', label='forward')
# reverse peptides
hist, bins = np.histogram(reversed_peptides["MSGFDB_SpecEValue"], bins=50)
ax_main.bar(bins[:-1], hist, width=np.diff(bins), align='edge',color='r', alpha=0.7, label='reverse')
# filter cutoff
ax_main.axvline(x=10 ** fitted_params[0], color="black", label = 'filter cutoff', linestyle="--")
ax_main.set_ylabel('number of peptide spectrum matches')
ax_main.set_xlabel('MSGFDB_SpecEValue')
ax_main.ticklabel_format(style='plain', axis='x')
ax_main.set_title(f'impact of spectral probability filter {10 ** fitted_params[0]}')
ax_main.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=2)
# Inset plot
# forward peptides
hist, bins = np.histogram(forward_peptides[forward_peptides["MSGFDB_SpecEValue"] < 0.000000002]["MSGFDB_SpecEValue"], bins=18)
ax_inset.bar(bins[:-1], hist, width=np.diff(bins), align='edge', color='g')
# reverse peptides
hist, bins = np.histogram(reversed_peptides[reversed_peptides["MSGFDB_SpecEValue"] < 0.000000002]["MSGFDB_SpecEValue"], bins=18)
ax_inset.bar(bins[:-1], hist, width=np.diff(bins), align='edge',color='r', alpha=0.7)
# filter cutoff
ax_inset.axvline(x=10 ** fitted_params[0], color="black", label = 'filter cutoff', linestyle="--")
ax_inset.set_xlabel('MSGFDB_SpecEValue')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Apply the filter to the dataset and recalculate peptide and spectral FDR.
#Filter the data according to the filter and recalculate FDR
forward_peptides = forward_peptides[
(forward_peptides["MSGFDB_SpecEValue"] < 10 ** optimization.x[0])
].copy()
reversed_peptides = reversed_peptides[
(reversed_peptides["MSGFDB_SpecEValue"] < 10 ** optimization.x[0])
].copy()
# Calculate FDR
f_spec = (forward_peptides["SpecID"].unique().size)
r_spec = reversed_peptides["SpecID"].unique().size
if (f_spec == 0) & (r_spec == 0):
fdr_spec = 1
else:
fdr_spec = (2*r_spec) / (f_spec + r_spec)
f_pep = forward_peptides["Peptide Sequence with Mods"].unique().size
r_pep = reversed_peptides["Peptide Sequence with Mods"].unique().size
if (f_pep == 0) & (r_pep == 0):
fdr_pep = 1
else:
fdr_pep = (r_pep) / (f_pep + r_pep)
del f_spec, r_spec, f_pep, r_pep
print("Spectral FDR:",fdr_spec,"\nPeptide FDR:",fdr_pep)
Spectral FDR: 0.049787398531117126 Peptide FDR: 0.05560704355885079
3) Collapse to unique peptides and normalize their relative abundance¶
At this point in analysis the data has been filtered to only high probability peptide identifications, but more than one scan within a dataset can have the same peptide identification. This can be due to the peptide eluting into the mass spectrometer over the course of multiple scans or a peptide eluting as multiple charge states. Sum the relative abundance for peptide sequences detected more than once in a dataset, leaving a total relative abundance value for each peptide in each dataset.
#extract data set id
forward_peptides['processed_DO_id'] = forward_peptides['SpecID'].str.split('_').str[0]
#drop SpecID and spectral probability columns since no longer relevant
forward_peptides.drop(['SpecID','MSGFDB_SpecEValue'],axis=1, inplace=True)
#for each peptide sequence with mods, sum the abundances for all scans/identifications
forward_peptides = forward_peptides.groupby(['processed_DO_id','Peptide Sequence with Mods'])['StatMomentsArea'].sum().to_frame().reset_index()
display(forward_peptides)
| processed_DO_id | Peptide Sequence with Mods | StatMomentsArea | |
|---|---|---|---|
| 0 | nmdc:dobj-11-2k6s1505 | AALAANSELMNK | 115710000.0 |
| 1 | nmdc:dobj-11-2k6s1505 | AFTVDEMR | 59240000.0 |
| 2 | nmdc:dobj-11-2k6s1505 | AGGSITSSGGVERNDVDYSAVLTK | 5785600.0 |
| ... | ... | ... | ... |
| 9757 | nmdc:dobj-11-zn9eyg46 | VAELEATTAR | 521284000.0 |
| 9758 | nmdc:dobj-11-zn9eyg46 | VAELEATTVR | 20899000.0 |
| 9759 | nmdc:dobj-11-zn9eyg46 | WLWNYYQYYK | 582031.0 |
9760 rows × 3 columns
Visualize the untransformed and un-normalized relative abundances.
#plot untransformed abundance values
untransformed_abundances_fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(x='processed_DO_id',y='StatMomentsArea',data=forward_peptides)
plt.ticklabel_format(style='plain', axis='y')
plt.xticks([])
plt.xlabel('sample')
plt.ylabel('Relative Peptide Abundance (Un-Normalized)')
Text(0, 0.5, 'Relative Peptide Abundance (Un-Normalized)')
Apply log2 transformation and median normalize peptide abundances.
#log2 tranformation
forward_peptides['StatMomentsAreaLog2']=np.log2(forward_peptides['StatMomentsArea'])
# Calculate group-wise (sample wise) median
group_medians = forward_peptides.groupby('processed_DO_id')['StatMomentsAreaLog2'].median()
# Calculate data wide median
all_data_median=forward_peptides['StatMomentsAreaLog2'].median()
# Subtract sample wise median from each value within its group
forward_peptides['StatMomentsAreaLog+Norm'] = forward_peptides.apply(
lambda row: row['StatMomentsAreaLog2'] - group_medians[row['processed_DO_id']], axis=1
)
#add back in a data wide median value to avoid negative abundances
forward_peptides['StatMomentsAreaLog+Norm'] = forward_peptides['StatMomentsAreaLog+Norm'] + all_data_median
transformed_abundances_fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(x='processed_DO_id',y='StatMomentsAreaLog+Norm',data=forward_peptides)
plt.xticks([])
plt.xlabel('sample')
plt.ylabel('Relative Peptide Abundance (Normalized)')
del ax, group_medians, all_data_median, forward_peptides['StatMomentsArea'], forward_peptides['StatMomentsAreaLog2'], reversed_peptides
4) Extract functional gene annotations for proteins¶
Collect peptide to protein mapping information for the passing peptide sequences.
peptide_protein_mapping = pd.DataFrame(unfilt_res[unfilt_res['Peptide Sequence with Mods'].isin(forward_peptides['Peptide Sequence with Mods'])][['Peptide Sequence with Mods','Protein']].drop_duplicates()).reset_index(drop=True)
peptide_protein_mapping
| Peptide Sequence with Mods | Protein | |
|---|---|---|
| 0 | NADSTLHTVTSGTAEGGESGTVFDSSYMAAGK | nmdc:wfmgan-11-a71gkg84.1_003113_596_1117 |
| 1 | ETNPTLADIEQDSKVKDAIEK | nmdc:wfmgan-11-a71gkg84.1_046034_35_346 |
| 2 | GETYVNVHTEKNPNGEIR | nmdc:wfmgan-11-a71gkg84.1_000318_745_1245 |
| ... | ... | ... |
| 16091 | EDRTPIVEIMNR | nmdc:wfmgan-11-hyze3t58.1_63946_3_284 |
| 16092 | EARFYSLYDKVWR | nmdc:wfmgan-11-hyze3t58.1_06697_29_526 |
| 16093 | FVRIEDETFNR | nmdc:wfmgan-11-hyze3t58.1_00200_3_428 |
16094 rows × 2 columns
Annotation information for these proteins can be found in 'Functional Annotation GFF' files.
Since the data_objects dataframe contains all objects associated with our study id, it also contains the relevant 'Functional Annotation GFF' files. Subset this dataframe to GFF files associated with the 33 biosample ids that have a proteomic output in proteomic_output_df.
annotation_input_df = data_objects[(data_objects['data_object_type']=='Functional Annotation GFF') & (data_objects['biosample_id'].isin(proteomic_output_df['biosample_id'].unique().tolist()))].reset_index(drop=True)
annotation_input_df = annotation_input_df[['id','file_size_bytes','data_object_type','md5_checksum','url','biosample_id']]
display(annotation_input_df)
| id | file_size_bytes | data_object_type | md5_checksum | url | biosample_id | |
|---|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-fdngms90 | 77411808 | Functional Annotation GFF | e4344ebdca641bd24709f89b8ff83dd3 | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-0jw5n594 |
| 1 | nmdc:dobj-11-xxffzc65 | 66314886 | Functional Annotation GFF | 9051bc023e5b14373dc016e6a3a0943b | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-13145k83 |
| 2 | nmdc:dobj-11-c24s6186 | 114196066 | Functional Annotation GFF | 3049a60a48baca9133461a7cde827b6f | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-1p0tct86 |
| ... | ... | ... | ... | ... | ... | ... |
| 28 | nmdc:dobj-11-tv82b372 | 60997438 | Functional Annotation GFF | cfd4e4c3c17d4d2020f364555836c655 | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-xvf06c81 |
| 29 | nmdc:dobj-11-92qp1w39 | 76608329 | Functional Annotation GFF | cda53e77cb70328498abc29685baf61c | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-yjp2dx65 |
| 30 | nmdc:dobj-11-2f3gzn94 | 80194278 | Functional Annotation GFF | 8e69768342ba2c92cd7feac23a8d933f | https://data.microbiomedata.org/data/nmdc:ompr... | nmdc:bsm-13-zms1jq91 |
31 rows × 6 columns
Preview the 'Functional Annotation GFF' files and determine a subset of gene annotation information that should be pulled from all 33 files.
#example annotation file
annotation_file=annotation_input_df.iloc[0]["url"]
print(annotation_file)
display(agg_func.gff_extract_features(annotation_file))
del annotation_file
https://data.microbiomedata.org/data/nmdc:omprc-13-e4rebn56/nmdc:wfmgan-11-bnzfrb88.1/nmdc_wfmgan-11-bnzfrb88.1_functional_annotation.gff
| ID | translation_table | partial | start_type | product | product_source | cath_funfam | cog | pfam | superfamily | ... | bound_moiety | ncRNA_class | intron_start | intron_end | number_of_repeats | average_repeat_length | median_repeat_length | average_spacer_length | median_spacer_length | Parent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | nmdc:wfmgan-11-bnzfrb88.1_000001_3_2729 | 11 | 5' | Edge | diaminopimelate decarboxylase/amino acid trans... | COG0019/COG0531 | 2.40.37.10,3.20.20.10 | COG0019,COG0531 | PF00278,PF02784,PF13520 | 51419 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | nmdc:wfmgan-11-bnzfrb88.1_000001_2750_2869 | 11 | NaN | ATG | hypothetical protein | Hypo-rule applied | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | nmdc:wfmgan-11-bnzfrb88.1_000001_2910_3746 | 11 | NaN | ATG | biotin carboxyl carrier protein | COG0511 | 2.70.70.10 | COG0511 | PF01551 | 51230 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 272943 | nmdc:wfmgan-11-bnzfrb88.1_208965_2_199 | 11 | 5',3' | Edge | F-type H+/Na+-transporting ATPase subunit alpha | KO:K02111 | 1.20.150.20 | COG0056 | PF00306 | 47917 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 272944 | nmdc:wfmgan-11-bnzfrb88.1_208966_2_199 | 11 | 5',3' | Edge | S-adenosyl-L-methionine hydrolase (adenosine-f... | KO:K22205 | 3.40.50.10790 | COG1912 | PF01887 | 102522 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 272945 | nmdc:wfmgan-11-bnzfrb88.1_208967_69_200 | 11 | 3' | ATG | DNA-binding NarL/FixJ family response regulator | COG2197 | 3.40.50.2300 | COG2197 | NaN | 52172 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
272946 rows × 33 columns
Extract information from all 33 annotation files (this takes a while to run) and merge with peptide_protein_mapping so there is a final table of peptide-protein-annotation mapping (annotation_mapping).
#get gene annotation information for proteins
genemapping = annotation_input_df[["id", "url"]].drop_duplicates().reset_index(drop=True)
genemapping = agg_func.iterate_file_extract(
pd_df=genemapping,\
identifier_col='id',\
url_col='url',\
extract_cols=['ID','product','product_source'],\
filter_col = 'ID',
filter_values = peptide_protein_mapping['Protein'].unique().tolist(),
file_type='gff'
)
#merge with protein mapping information. drop columns ID (which is equivalent to Protein) and id_col (which is the dataset id, unnecessary here since peptide to protein information isn't dataset specific)
annotation_mapping = genemapping.merge(peptide_protein_mapping,left_on='ID',right_on='Protein').drop(['ID','id_col'],axis=1)
annotation_mapping
| product | product_source | Peptide Sequence with Mods | Protein | |
|---|---|---|---|---|
| 0 | DNA-directed RNA polymerase subunit beta | KO:K03043 | ALMGSNMQR | nmdc:wfmgan-11-bnzfrb88.1_000010_2256_6224 |
| 1 | heme-degrading monooxygenase HmoA | COG2329 | ENADAYSRGAYPDVLK | nmdc:wfmgan-11-bnzfrb88.1_000125_3919_4230 |
| 2 | uncharacterized protein | KO:K09927 | RGTWWDWAPAK | nmdc:wfmgan-11-bnzfrb88.1_000220_2203_3417 |
| ... | ... | ... | ... | ... |
| 16091 | hypothetical protein | Hypo-rule applied | EAPITALDHK | nmdc:wfmgan-11-21ms6426.1_215294_58_276 |
| 16092 | hypothetical protein | Hypo-rule applied | LWAVDLATGSK | nmdc:wfmgan-11-21ms6426.1_216224_1_207 |
| 16093 | hypothetical protein | Hypo-rule applied | TLFLDVGSR | nmdc:wfmgan-11-21ms6426.1_216224_1_207 |
16094 rows × 4 columns
5) Generate annotation and protein mappings for peptides using "Razor" strategy¶
Identify the razor protein, which is a method of limiting the assignment of degenerate peptides (i.e., peptides that map to more than one forward protein) to a most likely matched protein. 'Razor' references the principle Occam's razor, also known as the law of parsimony.
The rules are as follows:
- If a peptide is unique to a protein, then that protein is the razor
- Else, if a peptide belongs to more than one protein, but one of those proteins has a unique peptide, then that protein is the razor
- Else, if a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that protein is the razor
- Else, if a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then collapse the proteins and gene annotations into single strings
- Else, if a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive
Use annotation_mapping as the input to the function razorprotein() from the agg_func script. This will return protein and gene annotation information for each peptide, according to the above rules.
razormapping = agg_func.razorprotein(annotation_mapping)
razormapping
| product | product_source | Peptide Sequence with Mods | razor | |
|---|---|---|---|---|
| 0 | dodecin | KO:K09165 | AAAAAIETASR | nmdc:wfmgan-11-p3dkkz54.1_002072_639_854 |
| 1 | dodecin, dodecin, dodecin, dodecin, dodecin | KO:K09165, KO:K09165, KO:K09165, KO:K09165, KO... | AAAAAVETASR | nmdc:wfmgan-11-6pq4bd13.1_044073_3_326, nmdc:w... |
| 2 | subtilisin family serine protease | COG1404 | AAAEPFHTAR | nmdc:wfmgan-11-21ms6426.1_002328_1282_2322 |
| ... | ... | ... | ... | ... |
| 4828 | hypothetical protein | Hypo-rule applied | YYADISAWLVDVR | nmdc:wfmgan-11-21ms6426.1_128887_2_331 |
| 4829 | hypothetical protein | Hypo-rule applied | YYGFNWGR | nmdc:wfmgan-11-dvvh0237.1_120316_3_251 |
| 4830 | uncharacterized protein | KO:K06996 | YYSELFGWQVHEVMPTYGLVHTEAGGK | nmdc:wfmgan-11-v53k8v91.1_045759_189_458 |
4831 rows × 4 columns
6) Perform protein rollup and summarize into a final aggregated table of relative protein abundance¶
Combine razor information with relative abundance values.
#merge needs to be 'right' because some peptides are removed in mapping functions if they have indeterminante mappings
forward_peptides = forward_peptides.merge(razormapping,how='right',on=['Peptide Sequence with Mods'])
del annotation_mapping, peptide_protein_mapping, genemapping
forward_peptides
| processed_DO_id | Peptide Sequence with Mods | StatMomentsAreaLog+Norm | product | product_source | razor | |
|---|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-vjmd5h28 | AAAAAIETASR | 30.820153 | dodecin | KO:K09165 | nmdc:wfmgan-11-p3dkkz54.1_002072_639_854 |
| 1 | nmdc:dobj-11-tk9hc081 | AAAAAVETASR | 29.374877 | dodecin, dodecin, dodecin, dodecin, dodecin | KO:K09165, KO:K09165, KO:K09165, KO:K09165, KO... | nmdc:wfmgan-11-6pq4bd13.1_044073_3_326, nmdc:w... |
| 2 | nmdc:dobj-11-sx7cyr58 | AAAEPFHTAR | 28.081336 | subtilisin family serine protease | COG1404 | nmdc:wfmgan-11-21ms6426.1_002328_1282_2322 |
| ... | ... | ... | ... | ... | ... | ... |
| 8599 | nmdc:dobj-11-sx7cyr58 | YYADISAWLVDVR | 27.222319 | hypothetical protein | Hypo-rule applied | nmdc:wfmgan-11-21ms6426.1_128887_2_331 |
| 8600 | nmdc:dobj-11-xm1yjv87 | YYGFNWGR | 23.220936 | hypothetical protein | Hypo-rule applied | nmdc:wfmgan-11-dvvh0237.1_120316_3_251 |
| 8601 | nmdc:dobj-11-7psagy79 | YYSELFGWQVHEVMPTYGLVHTEAGGK | 27.015979 | uncharacterized protein | KO:K06996 | nmdc:wfmgan-11-v53k8v91.1_045759_189_458 |
8602 rows × 6 columns
De-log the peptide abundances, sum the abundances for each razor protein and log transform the rolled up protein abundances.
#de-log abundance
forward_peptides['StatMomentsAreaNorm']=2**forward_peptides['StatMomentsAreaLog+Norm']
forward_peptides.drop('StatMomentsAreaLog+Norm',axis=1,inplace=True)
#sum abundance for each sorted_list protein in each dataset
prot_info = forward_peptides.columns[~forward_peptides.columns.isin(['StatMomentsAreaNorm','Peptide Sequence with Mods'])].tolist()
protein_abundances = forward_peptides.groupby(prot_info)['StatMomentsAreaNorm'].sum().reset_index()
#re-log2 tranform abundance
protein_abundances['StatMomentsAreaLog+Norm']=np.log2(protein_abundances['StatMomentsAreaNorm'])
protein_abundances.drop('StatMomentsAreaNorm',axis=1,inplace=True)
del prot_info, forward_peptides
protein_abundances
| processed_DO_id | product | product_source | razor | StatMomentsAreaLog+Norm | |
|---|---|---|---|---|---|
| 0 | nmdc:dobj-11-2k6s1505 | DNA-binding protein HU-beta, DNA-binding prote... | KO:K03530, KO:K03530 | nmdc:wfmgan-11-8zxjs882.1_040719_104_331, nmdc... | 25.187616 |
| 1 | nmdc:dobj-11-2k6s1505 | DNA-binding transcriptional LysR family regula... | COG0583, COG0583, COG0583, COG0583 | nmdc:wfmgan-11-21ms6426.1_010178_1_894, nmdc:w... | 26.332428 |
| 2 | nmdc:dobj-11-2k6s1505 | DNA-directed RNA polymerase subunit beta' | KO:K03046 | nmdc:wfmgan-11-yjeh0m30.1_003176_3_1979 | 27.042324 |
| ... | ... | ... | ... | ... | ... |
| 6450 | nmdc:dobj-11-zn9eyg46 | peptidoglycan/xylan/chitin deacetylase (PgdA/C... | COG0726, COG0726, COG0726 | nmdc:wfmgan-11-6pq4bd13.1_001294_1_411, nmdc:w... | 25.123978 |
| 6451 | nmdc:dobj-11-zn9eyg46 | photosystem II CP43 chlorophyll apoprotein, ph... | KO:K02705, KO:K02705, KO:K02705 | nmdc:wfmgan-11-hyze3t58.1_07201_3_464, nmdc:wf... | 28.599485 |
| 6452 | nmdc:dobj-11-zn9eyg46 | predicted porin | COG3203 | nmdc:wfmgan-11-hyze3t58.1_02099_286_783 | 29.142025 |
6453 rows × 5 columns
Final aggregated table of relative protein abundance¶
Reformat these results into a proteomic table, where each row indicates a protein and each column indicates a sample/dataset. The values within are log transformed, median normalized relative abundance values. This table or the longform version above can be used in further proteomic analyses.
#pivot to wide crosstab
aggregated_proteomic_output = protein_abundances.pivot(index='razor',columns='processed_DO_id',values='StatMomentsAreaLog+Norm')
aggregated_proteomic_output.columns.name = None
aggregated_proteomic_output
| nmdc:dobj-11-2k6s1505 | nmdc:dobj-11-2szyc560 | nmdc:dobj-11-301yh759 | nmdc:dobj-11-3378mg86 | nmdc:dobj-11-3ak2bc31 | nmdc:dobj-11-4bwzhs42 | nmdc:dobj-11-7psagy79 | nmdc:dobj-11-7r5yea49 | nmdc:dobj-11-7x21d450 | nmdc:dobj-11-bxyvx506 | ... | nmdc:dobj-11-tk9hc081 | nmdc:dobj-11-vjmd5h28 | nmdc:dobj-11-wcn69x37 | nmdc:dobj-11-xm1yjv87 | nmdc:dobj-11-y2aefn18 | nmdc:dobj-11-yrbs4v96 | nmdc:dobj-11-z47wnf07 | nmdc:dobj-11-z8hc0j30 | nmdc:dobj-11-zhz66007 | nmdc:dobj-11-zn9eyg46 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| razor | |||||||||||||||||||||
| nmdc:wfmgan-11-21ms6426.1_000036_702_1004 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| nmdc:wfmgan-11-21ms6426.1_000053_3_563, nmdc:wfmgan-11-p3dkkz54.1_008013_251_1072 | NaN | NaN | NaN | NaN | NaN | NaN | 30.499817 | NaN | NaN | NaN | ... | 29.083049 | 32.948118 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| nmdc:wfmgan-11-21ms6426.1_000057_4146_5369, nmdc:wfmgan-11-p3dkkz54.1_002120_389_1612, nmdc:wfmgan-11-z0raab34.1_009908_3_737 | NaN | 30.067875 | 28.391404 | 28.935731 | NaN | NaN | 30.458277 | NaN | NaN | 28.7232 | ... | NaN | 30.467756 | 26.462791 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| nmdc:wfmgan-11-z0raab34.1_148395_2_283 | NaN | 31.373664 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| nmdc:wfmgan-11-z0raab34.1_160266_3_278 | NaN | 27.750197 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| nmdc:wfmgan-11-z0raab34.1_167280_3_278 | NaN | 24.717223 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3442 rows × 31 columns
The generated protein table can be used as input to the software pmartR, which performs statistical analyses such as ANOVA and independence of missing data (IMD) tests. In this case, the aggregated proteomics table (aggregated_proteomic_output) would be equivalent to pmartR's e_data and the peptide to protein to gene mappings (razor_mapping) would be equivalent to pmartR's e_meta.

pmartR requires sample metadata to parameterize analyses and interpret the data. For this example dataset, an API call will capture the biosample metadata (sample_metadata) that would be equivalent to pmartR's f_data.
Gather biosample metadata via the nmdc_api_utilites BiosampleSeach object, using the function get_record_by_filter() in biosample_ids associated with each output in proteomic_output_df.
from nmdc_api_utilities.biosample_search import BiosampleSearch
# since we are querying the biosample collection, we need to create an instance of it
bs_client = BiosampleSearch(env=ENV)
# create a list of lists of ids to query the BiosampleSearch object
chunked_list = split_list(proteomic_output_df['biosample_id'].unique().tolist())
biosamples = []
# loop through the chunked list of ids
for chunk in chunked_list:
# create the filter - query the BiosampleSearch object looking for records where the id field is in the chunk of ids from proteomic_output_df
filter_list = dp_client._string_mongo_list(chunk)
filter = f'{{"id": {{"$in": {filter_list}}}}}'
# get the results
biosamples += bs_client.get_record_by_filter(filter=filter,fields="id,depth.has_numeric_value", max_page_size=100, all_pages=True)
#normalize the json columns
biosamples = pd.json_normalize(biosamples).drop_duplicates().rename(columns={'id':'biosample_id'})
#merge biosample and processed data object information
sample_metadata = biosamples.merge(proteomic_output_df[['processed_DO_id','biosample_id']],on='biosample_id')
del biosamples
sample_metadata
| biosample_id | depth.has_numeric_value | processed_DO_id | |
|---|---|---|---|
| 0 | nmdc:bsm-13-0jw5n594 | 0.0 | nmdc:dobj-11-pxnhdj84 |
| 1 | nmdc:bsm-13-13145k83 | 0.2 | nmdc:dobj-11-xm1yjv87 |
| 2 | nmdc:bsm-13-1p0tct86 | 0.2 | nmdc:dobj-11-bxyvx506 |
| ... | ... | ... | ... |
| 28 | nmdc:bsm-13-xvf06c81 | 0.4 | nmdc:dobj-11-m808mj05 |
| 29 | nmdc:bsm-13-yjp2dx65 | 0.1 | nmdc:dobj-11-301yh759 |
| 30 | nmdc:bsm-13-zms1jq91 | 0.5 | nmdc:dobj-11-sx7cyr58 |
31 rows × 3 columns