Proteomic Data Aggregation and Visualization¶

This notebook demonstrates aggregation of proteomic data via the National Microbiome Data Collaborative (NMDC)'s Runtime API. It uses the nmdc_api_utilities package (as of March 2025) to do this exploration. It involves using nmdc_api_utilites objects to make NMDC API requests easier, and using utility function to help us with data processing tasks. More information about the package can be found here. It highlights how the NMDC's schema can be used to overcome some of the numerous challenges associated with this type of aggregation. Please note that this notebook is intended for individuals with experience performing mass spectrometry based proteomic analyses and that various parameter and processing choices were made for this example use case. They are not broadly applicable and should be adjusted as needed.

Notebook Steps:

  1. Assess background information and collect datasets for an example study of riverbed sediment along the Columbia River

  2. Apply a spectral probability filter across the data that optimizes the number of identifications for an FDR of 0.05

  3. Collapse to unique peptides and normalize quantification

  4. Extract functional gene annotations for proteins

  5. Generate annotation and protein mappings for peptides using "Razor" strategy

  6. Perform protein rollup using the "Razor" results and summarize into an aggregated table of relative protein abundance

Define a function to split a list into chunks¶

Since we will need to use a list of ids to query a new collection in the API, we need to limit the number of ids we put in a query. This function splits a list into chunks of 100. Note that the chunk_size has a default of 100, but can be adjusted.

In [3]:
# Define a function to split ids into chunks
def split_list(input_list, chunk_size=100):
    result = []
    
    for i in range(0, len(input_list), chunk_size):
        result.append(input_list[i:i + chunk_size])
        
    return result

Import libraries and python scripts containing functions necessary to run this notebook. 'aggregation_functions.py' (also in this folder) includes spectral probability filtering and protein mapping functions. nmdc_api_utilities includes functions for API traversal of the collections endpoint.

In [5]:
import requests
from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sys
from IPython.display import Image, display
import nmdc_api_utilities

if 'google.colab' in sys.modules:

   #module in this folder with specific protein aggregation functions
   !wget https://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/proteomic_aggregation/python/aggregation_functions.py
   import aggregation_functions as agg_func

   #pmartR logo
   !wget https://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/proteomic_aggregation/pmartR_logo_final.jpg

else:
  #module in this folder with specific protein aggregation functions
  import aggregation_functions as agg_func

1) Assess background information and collect data for an example study of riverbed sediment along the Columbia River¶

Review the example study on the NMDC data portal. Use the study id embedded in the url (nmdc:sty-11-aygzgv51) to collect all related data objects via nmdc_api_utilities package and reformat the json output into a pandas dataframe. These data objects reference both input files (i.e. raw, gff) and output files (i.e. metaproteomic results) to the NMDC workflows. To get the data_objects, we will create a DataObjectSearch object and use the get_data_objects_for_studies() function.

In [6]:
pd.set_option("display.max_rows", 6)
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch(env=ENV)
#get all data objects associated with this study id using the NMDC find endpoint
data = do_client.get_data_objects_for_studies(study_id='nmdc:sty-11-aygzgv51')
data = pd.DataFrame(data)

#reformat data into dataframe (keeping biosample id)
data_objects=[]
for index, row in data.iterrows():
    bio_id = row['biosample_id']
    row_out = pd.json_normalize(row['data_objects'])
    row_out['biosample_id'] = bio_id
    data_objects.append(row_out)

data_objects = pd.concat(data_objects)[['id','name','file_size_bytes','data_object_type','md5_checksum','url','biosample_id','in_manifest','was_generated_by']]
display(data_objects)

del data, index, row, row_out, bio_id
id name file_size_bytes data_object_type md5_checksum url biosample_id in_manifest was_generated_by
0 nmdc:dobj-11-ct23bs36 nmdc_dobj-11-4kwfhk83_nmdc_dobj-11-haabxj14_ms... 5.949491e+06 Unfiltered Metaproteomics Results 97b010e957351b258d07759f56d48456 https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-0jw5n594 NaN nmdc:wfmp-11-xdxeh614.2
1 nmdc:dobj-13-k2d53r96 10533.1.165310.TCCGAGT-AACTCGG.fastq.gz 2.168673e+09 Metagenome Raw Reads NaN NaN nmdc:bsm-13-0jw5n594 NaN NaN
2 nmdc:dobj-11-wr2qfy21 nmdc_dobj-11-4kwfhk83_nmdc_dobj-11-haabxj14_Pr... 1.450400e+04 Protein Report f85ee1a8ed1816d6633c7e7571cb873e https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-0jw5n594 [nmdc:manif-11-7796sg87] NaN
... ... ... ... ... ... ... ... ... ...
48 nmdc:dobj-11-b78dj613 nmdc_wfmgan-11-pt5b7p34.1_structural_annotatio... 2.500742e+07 Structural Annotation GFF f22df4d08b05ac00b159ed7796a0f25f https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-w2cwcx50 NaN NaN
49 nmdc:dobj-11-7ysdr517 nmdc_wfmgan-11-pt5b7p34.1_crt.crisprs 1.291200e+04 Crispr Terms 98886c59028bdf77320d9ddddc27562f https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-w2cwcx50 NaN NaN
50 nmdc:dobj-11-b5enyc41 nmdc_wfmag-11-8gtd3323.1_gtdbtk.bac122.summary... 5.000000e+01 GTDBTK Bacterial Summary 3e9330de1faaeb1dec47b5c8663c54fb https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-w2cwcx50 NaN nmdc:wfmag-11-8gtd3323.1

4330 rows × 9 columns

Subset the data objects to 'Unfiltered Metaproteomic Results'. These files contain the proteomic workflow outputs that will be used for proteomic aggregation.

In [7]:
proteomic_output_df = data_objects[data_objects['data_object_type']=='Unfiltered Metaproteomics Results'].reset_index(drop=True).rename(columns={'id':'processed_DO_id'})
#how many result files exist for each biosample?
display(proteomic_output_df.groupby('biosample_id')['data_object_type'].value_counts().reset_index())
biosample_id data_object_type count
0 nmdc:bsm-13-0jw5n594 Unfiltered Metaproteomics Results 3
1 nmdc:bsm-13-13145k83 Unfiltered Metaproteomics Results 3
2 nmdc:bsm-13-1p0tct86 Unfiltered Metaproteomics Results 3
... ... ... ...
30 nmdc:bsm-13-xvf06c81 Unfiltered Metaproteomics Results 3
31 nmdc:bsm-13-yjp2dx65 Unfiltered Metaproteomics Results 3
32 nmdc:bsm-13-zms1jq91 Unfiltered Metaproteomics Results 3

33 rows × 3 columns

Since there is more than one result file for each biosample, choose a workflow version to use in the analysis.

In [8]:
#Instantiate instances for nmdc_api_utilities
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
from nmdc_api_utilities.data_processing import DataProcessing
# since we are querying the workflow collection, we need to create an instance of it
we_client = WorkflowExecutionSearch(env=ENV)
# create an instance of the data processing object
dp_client = DataProcessing()

# create a list of lists of processed proteomic result ids to query the workflow execution objects
chunked_list = split_list(proteomic_output_df['processed_DO_id'].unique().tolist())
version_output = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the WorkflowExecutionSearch object looking for records
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
    # get the results
    version_output += we_client.get_record_by_filter(filter=filter,fields="has_output,version,has_input", max_page_size=100, all_pages=True)

version_output = pd.json_normalize(version_output)

version_output = version_output[version_output['version'].str.contains('v1.2.1',na=False)]

version_output
Out[8]:
id has_input has_output version
7 nmdc:wfmp-11-j2tktj20.1 [nmdc:dobj-11-ms76kj12, nmdc:dobj-11-dyq78w56,... [nmdc:dobj-11-2k6s1505, nmdc:dobj-11-m6bks270,... v1.2.1
8 nmdc:wfmp-11-3z777070.1 [nmdc:dobj-11-tezmee57, nmdc:dobj-11-2yvsr009,... [nmdc:dobj-11-2szyc560, nmdc:dobj-11-d6yc9e02,... v1.2.1
9 nmdc:wfmp-11-7h4m6t36.1 [nmdc:dobj-11-zd25ft61, nmdc:dobj-11-92qp1w39,... [nmdc:dobj-11-301yh759, nmdc:dobj-11-v0780021,... v1.2.1
... ... ... ... ...
92 nmdc:wfmp-11-k5qcay19.1 [nmdc:dobj-11-hp2vmj63, nmdc:dobj-11-p55wx268,... [nmdc:dobj-11-z8hc0j30, nmdc:dobj-11-55vqpg72,... v1.2.1
93 nmdc:wfmp-11-f6sfn088.1 [nmdc:dobj-11-mzxj8743, nmdc:dobj-11-nba9wh55,... [nmdc:dobj-11-zhz66007, nmdc:dobj-11-ppng2124,... v1.2.1
95 nmdc:wfmp-11-xm389s70.1 [nmdc:dobj-11-6n49rm30, nmdc:dobj-11-ysn0es28,... [nmdc:dobj-11-zn9eyg46, nmdc:dobj-11-pp5wnn85,... v1.2.1

31 rows × 4 columns

We will use the workflow's has_input later to get the Functional Annotation files, but here we only need has_output to get the correct version of our "Unfiltered Metaproteomic Results".

In [9]:
# filter proteomic_output_df results to those from workflow version v1.2.1
workflow_outputs = version_output.drop(columns='has_input').rename(columns={'id':'workflow_id','version':'workflow_version'}).explode('has_output').drop_duplicates()
proteomic_output_df = proteomic_output_df.merge(workflow_outputs, right_on = 'has_output',left_on='processed_DO_id').drop('has_output',axis=1)

proteomic_output_df
Out[9]:
processed_DO_id name file_size_bytes data_object_type md5_checksum url biosample_id in_manifest was_generated_by workflow_id workflow_version
0 nmdc:dobj-11-pxnhdj84 nmdc_dobj-11-4kwfhk83_nmdc_dobj-11-haabxj14_ms... 5949491.0 Unfiltered Metaproteomics Results 0b72c4481cd4664e43be04aa5fd9095c https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-0jw5n594 [nmdc:manif-11-7796sg87] NaN nmdc:wfmp-11-xdxeh614.1 v1.2.1
1 nmdc:dobj-11-z8hc0j30 nmdc_dobj-11-hp2vmj63_nmdc_dobj-11-716e7y35_ms... 8376109.0 Unfiltered Metaproteomics Results 8785b9538d977a50fd35d11bfcb037ce https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-2fw5j754 [nmdc:manif-11-7796sg87] NaN nmdc:wfmp-11-k5qcay19.1 v1.2.1
2 nmdc:dobj-11-k3ebb695 nmdc_dobj-11-mz0d0076_nmdc_dobj-11-sf7dpr43_ms... 4816449.0 Unfiltered Metaproteomics Results 5b5bc1250b70d0db72708b1747c5a9ec https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-5bfhqy53 [nmdc:manif-11-7796sg87] NaN nmdc:wfmp-11-v5vs4w20.1 v1.2.1
... ... ... ... ... ... ... ... ... ... ... ...
28 nmdc:dobj-11-7r5yea49 nmdc_dobj-11-qtxceg35_nmdc_dobj-11-feg1eb37_ms... 8003555.0 Unfiltered Metaproteomics Results b00a218d2b2ac4aa9467f9d51c5899de https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-peafgc08 [nmdc:manif-11-7796sg87] NaN nmdc:wfmp-11-cx1srs63.1 v1.2.1
29 nmdc:dobj-11-zn9eyg46 nmdc_dobj-11-6n49rm30_nmdc_dobj-11-nb9f8z34_ms... 2321178.0 Unfiltered Metaproteomics Results 8eb29a562832f1a8862a5a643e297389 https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-sepcxd06 [nmdc:manif-11-7796sg87] NaN nmdc:wfmp-11-xm389s70.1 v1.2.1
30 nmdc:dobj-11-301yh759 nmdc_dobj-11-zd25ft61_nmdc_dobj-11-vek6s281_ms... 10756563.0 Unfiltered Metaproteomics Results c82e8026a2d974cb4f9a75b6eef5c62b https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-yjp2dx65 [nmdc:manif-11-7796sg87] NaN nmdc:wfmp-11-7h4m6t36.1 v1.2.1

31 rows × 11 columns

There are various requirements that enable mass spectrometry runs to be aggregated and analyzed together. For example, runs need to be performed in succession, on the same instrument. The NMDC schema can make it easier to find these proteomic results by linking them via a slot called in_manifest.

Look at the in_manifest id on these proteomic outputs to confirm that all runs are in the same manifest record, and pull that record. If that manifest record's manifest_category value is 'instrument_run', then it confirms that these are LC-MS/MS runs that were performed in succession on the same instrument. Proteomic outputs from different manifest records should not be aggregated.

Using the nmdc_api_utilities package, we create a ManifestSearch object, create the needed filter, and use the get_record_by_filter() function to get records.

In [10]:
from nmdc_api_utilities.manifest_search import ManifestSearch
#extract in_manifest ids for this study (in this case a single one available)
manifest_id = proteomic_output_df.explode('in_manifest')['in_manifest'].unique()
# since we are querying the manifest collection, we need to create an instance of it
m_client = ManifestSearch(env=ENV)
# create a list of lists of ids to query the ManifestSearch object
chunked_list = split_list(proteomic_output_df['in_manifest'].explode().unique().tolist())
manifest = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the ManifestSearch object looking for recordds where the id field is in the chunk of ids from proteomic_output_df
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"id": {{"$in": {filter_list}}}}}'
    # get the results
    manifest += m_client.get_record_by_filter(filter=filter, max_page_size=100, all_pages=True)

display(manifest)
[{'id': 'nmdc:manif-11-7796sg87',
  'manifest_category': 'instrument_run',
  'type': 'nmdc:Manifest',
  'description': 'collection of metaproteomic analyses from the same instrument run nmdc:sty-11-aygzgv51'}]

Look at an example of the information in 'Unfiltered Metaproteomics Results', which contains peptide identification and relative abundance information.

In [11]:
#example unfiltered results 
unfilt_results = proteomic_output_df.iloc[0]["url"]
print(unfilt_results)
display(agg_func.tsv_extract(unfilt_results))

del unfilt_results
https://nmdcdemo.emsl.pnnl.gov/proteomics/results/nmdc_dobj-11-4kwfhk83_nmdc_dobj-11-haabxj14_msgfplus_syn_PlusSICStats.txt
ResultID Scan FragMethod SpecIndex Charge PrecursorMZ DelM DelM_PPM MH Peptide ... PeakMaxIntensity PeakSignalToNoiseRatio FWHMInScans PeakArea ParentIonIntensity ParentIonMZ StatMomentsArea PeakScanStart PeakScanEnd PeakWidthMinutes
0 1 40580 HCD 1 2 871.37396 0.03180 18.27841 1740.705490 R.GSSSGGGYSSGSSSYGSGGR.Q ... 554184.0 40.130 26 1.600100e+07 544336.0 871.37 1.703900e+07 40523 40650 0
1 2 40920 HCD 2 3 721.68713 -0.00951 -4.39777 2163.056312 R.LGEDNINVVEGNEQFISASK.S ... 3663900.0 15.850 175 4.756700e+08 2424600.0 721.69 4.050300e+08 40672 41092 0
2 3 41460 HCD 3 2 1082.02820 -0.00732 -3.38447 2163.056312 R.LGEDNINVVEGNEQFISASK.S ... 9434500.0 97.190 117 1.130600e+09 1766900.0 1082.03 1.024900e+09 41359 41933 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25312 25313 33683 HCD 19762 2 768.35883 0.00785 5.11714 1535.702530 R.RQWWEDYEPAR.A ... 648678.0 10.270 53 2.078300e+07 187764.0 768.36 4.061700e+06 33613 33815 0
25313 25314 17825 HCD 19763 2 555.74408 0.00061 0.54733 1111.483631 -.NGQQECFTGK.W ... 1287000.0 28.720 39 6.254500e+07 909122.0 555.74 6.147000e+07 17776 17844 0
25314 25315 28560 HCD 19764 2 712.34711 -0.02725 -19.13751 1424.717465 R.RTRRDYAECVR.E ... 738629.0 7.464 612 2.873400e+08 224858.0 712.35 1.981300e+08 27804 28782 0

25315 rows × 36 columns

Extract information from all 33 proteomic results via the function iterate_file_extract() in agg_func, and put them into a single dataframe, where each scan in each dataset has the unique identifier SpecID. Clean prefix and suffix off of each peptide sequence. Since this data was processed using a target-decoy approach, determine the type of protein being matched to each peptide: contaminant, reverse (false positive match to the reversed amino acid sequence of a protein), or forward (match to the true, forward amino acid sequence of a protein). The presence of forward and reverse matches enables FDR estimation in the next step.

In [12]:
unfilt_res = agg_func.iterate_file_extract(identifier_col='processed_DO_id',
                                url_col='url',
                                extract_cols=['Charge','Scan','Peptide','Protein','MSGFDB_SpecEValue','StatMomentsArea'],
                                pd_df=proteomic_output_df,
                                filter_col = None,
                                filter_values = None,
                                file_type='tsv'
                                )


#create identifier for each scan in each dataset
unfilt_res["SpecID"] = unfilt_res.apply(lambda row: str(row["id_col"]) + "_" + str(row["Scan"]), axis=1)

#clean off the prefix and suffix from the sequence but keep any mods
unfilt_res["Peptide Sequence with Mods"] = unfilt_res["Peptide"].apply(agg_func.sequence_noprefsuff)
del unfilt_res['Peptide']

#determine protein type (contaminant, reverse, forward)
unfilt_res["Protein_Type"] = unfilt_res["Protein"].apply(agg_func.findproteinname)

unfilt_res
Out[12]:
Charge Scan Protein MSGFDB_SpecEValue StatMomentsArea id_col SpecID Peptide Sequence with Mods Protein_Type
0 2 40580 Contaminant_K22E_HUMAN 1.585800e-23 1.703900e+07 nmdc:dobj-11-pxnhdj84 nmdc:dobj-11-pxnhdj84_40580 GSSSGGGYSSGSSSYGSGGR None
1 3 40920 Contaminant_TRYP_BOVIN 5.821000e-23 4.050300e+08 nmdc:dobj-11-pxnhdj84 nmdc:dobj-11-pxnhdj84_40920 LGEDNINVVEGNEQFISASK None
2 2 41460 Contaminant_TRYP_BOVIN 6.047300e-23 1.024900e+09 nmdc:dobj-11-pxnhdj84 nmdc:dobj-11-pxnhdj84_41460 LGEDNINVVEGNEQFISASK None
... ... ... ... ... ... ... ... ... ...
945188 2 21829 nmdc:wfmgan-11-yjeh0m30.1_017408_523_909 4.998200e-07 4.386500e+07 nmdc:dobj-11-301yh759 nmdc:dobj-11-301yh759_21829 EALEAASGQDLR Forward
945189 3 22940 XXX_nmdc:wfmgan-11-yjeh0m30.1_034034_1_159 4.998200e-07 1.295000e+08 nmdc:dobj-11-301yh759 nmdc:dobj-11-301yh759_22940 GYVREVTEVSREPGFCEEAR Reversed
945190 3 29989 XXX_nmdc:wfmgan-11-yjeh0m30.1_019451_1_858 4.999800e-07 1.587500e+08 nmdc:dobj-11-301yh759 nmdc:dobj-11-301yh759_29989 GEGHMPIATSPRVWGYLQELEGR Reversed

945191 rows × 9 columns

2) Apply a spectral probability filter across the data that optimizes the number of identifications for an FDR of 0.05¶

A challenge associated with aggregating mass spectrometry data is that there are always false identifications, which can be mitigated by imposing a spectral probability filter on the data being analyzed. The same spectral probability filter needs to be applied across datasets when they are being compared. The filter value itself is chosen by weighing the number of 'true' identifications retained with the proximity of the data to a chosen false discovery rate (FDR) (usually 0.05 or 0.01). NMDC's metaproteomic workflow provides 'true' and 'false' identifications for FDR estimation in the 'Unfiltered Metaproteomic Result' files.

Create a dataframe of peptide identifications (ignoring protein mapping). Filter identifications to the peptide sequence with the smallest SpecEValue for each SpecID, so there is a single, highest probability identification for each scan.

In [13]:
edata = unfilt_res[['SpecID','Peptide Sequence with Mods','MSGFDB_SpecEValue','Protein_Type','StatMomentsArea']].drop_duplicates()  # important to remove any duplicated rows here!

#for each SpecID, select the peptide spectrum match with the smallest MSGFDB_SpecEValue (.idxmin() takes the first entry when there's multiple matches)
idx = edata.groupby(['SpecID'])['MSGFDB_SpecEValue'].idxmin()
edata = edata.loc[idx].reset_index(drop=True)
del idx

display(edata)

assert len(edata['SpecID'].unique())==edata.shape[0], "still more than one identification per scan"
SpecID Peptide Sequence with Mods MSGFDB_SpecEValue Protein_Type StatMomentsArea
0 nmdc:dobj-11-2k6s1505_10005 PPDERERSEEAEKRDEERDRVRDELLAGAEEGEPR 4.994300e-07 Forward 1.238500e+09
1 nmdc:dobj-11-2k6s1505_10008 LRAGSEPR 1.172600e-07 Forward 1.234600e+08
2 nmdc:dobj-11-2k6s1505_10009 WAKEIENQK 2.473900e-07 Forward 5.731600e+07
... ... ... ... ... ...
797680 nmdc:dobj-11-zn9eyg46_9976 CRDKWYQMAEGK 4.110300e-07 Reversed 6.141800e+06
797681 nmdc:dobj-11-zn9eyg46_9979 WWQQINK 4.622300e-08 Reversed 4.143000e+07
797682 nmdc:dobj-11-zn9eyg46_9980 RFEECERECK 1.126600e-07 Forward 2.284500e+06

797683 rows × 5 columns

Create separate dataframes of forward and reverse peptide spectrum matches.

In [14]:
forward_peptides = edata[edata["Protein_Type"] == "Forward"].copy().reset_index(drop=True)
del forward_peptides["Protein_Type"]
display(forward_peptides)
SpecID Peptide Sequence with Mods MSGFDB_SpecEValue StatMomentsArea
0 nmdc:dobj-11-2k6s1505_10005 PPDERERSEEAEKRDEERDRVRDELLAGAEEGEPR 4.994300e-07 1.238500e+09
1 nmdc:dobj-11-2k6s1505_10008 LRAGSEPR 1.172600e-07 1.234600e+08
2 nmdc:dobj-11-2k6s1505_10009 WAKEIENQK 2.473900e-07 5.731600e+07
... ... ... ... ...
401082 nmdc:dobj-11-zn9eyg46_9959 DAVSGRHHAQGR 6.589700e-09 1.566600e+07
401083 nmdc:dobj-11-zn9eyg46_9967 FYRFSGR 1.080500e-07 3.418000e+08
401084 nmdc:dobj-11-zn9eyg46_9980 RFEECERECK 1.126600e-07 2.284500e+06

401085 rows × 4 columns

In [15]:
reversed_peptides = edata[edata["Protein_Type"] == "Reversed"].copy().reset_index(drop=True)
del reversed_peptides["Protein_Type"]
del edata
display(reversed_peptides)
SpecID Peptide Sequence with Mods MSGFDB_SpecEValue StatMomentsArea
0 nmdc:dobj-11-2k6s1505_10013 EFVDIISYMENENHSDIEYPLLYKWDSKSTVINR 1.021500e-07 7.371400e+08
1 nmdc:dobj-11-2k6s1505_10019 QWHPNFLR 3.448500e-07 3.220300e+09
2 nmdc:dobj-11-2k6s1505_10026 LAEREGGAR 1.451200e-08 9.604800e+08
... ... ... ... ...
383749 nmdc:dobj-11-zn9eyg46_9969 TDTATHWHK 1.879300e-07 1.960000e+08
383750 nmdc:dobj-11-zn9eyg46_9976 CRDKWYQMAEGK 4.110300e-07 6.141800e+06
383751 nmdc:dobj-11-zn9eyg46_9979 WWQQINK 4.622300e-08 4.143000e+07

383752 rows × 4 columns

Use the function optimize_specFilt() in agg_func to find a log10 spectral probability filter that weighs the number of forward peptides retained with the proximity of the dataset to a 0.05 spectral FDR. Visualize the impact of the spectral probability filter by plotting the number of forward and reverse peptides retained.

The main plot below is a histogram of forward and reverse peptides across all spectral probability values. The inset within this plot depicts a subset of the smallest spectral probabililty values, with the red bar before the dashed line representing the estimated number of false identifications that will be included in this analysis.

In [16]:
#initial guess at a log10 spectral probability filter value
initial_specprob_filter = -15

#perform optimization
optimization = agg_func.optimize_specFilt(initial_specprob_filter,forward_peptides,reversed_peptides)

#visualize filter
fitted_params=optimization.x
fig, ax_main = plt.subplots()
ax_inset = plt.axes([0.45, 0.45, 0.35, 0.35])


# Main plot
# forward peptides
hist, bins = np.histogram(forward_peptides["MSGFDB_SpecEValue"], bins=50)
ax_main.bar(bins[:-1], hist, width=np.diff(bins), align='edge', color='g', label='forward')

# reverse peptides
hist, bins = np.histogram(reversed_peptides["MSGFDB_SpecEValue"], bins=50)
ax_main.bar(bins[:-1], hist, width=np.diff(bins), align='edge',color='r', alpha=0.7, label='reverse')

# filter cutoff
ax_main.axvline(x=10 ** fitted_params[0], color="black", label = 'filter cutoff', linestyle="--")
ax_main.set_ylabel('number of peptide spectrum matches')
ax_main.set_xlabel('MSGFDB_SpecEValue')
ax_main.ticklabel_format(style='plain', axis='x')
ax_main.set_title(f'impact of spectral probability filter {10 ** fitted_params[0]}')
ax_main.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=2)


# Inset plot
# forward peptides
hist, bins = np.histogram(forward_peptides[forward_peptides["MSGFDB_SpecEValue"] < 0.000000002]["MSGFDB_SpecEValue"], bins=18)
ax_inset.bar(bins[:-1], hist, width=np.diff(bins), align='edge', color='g')

# reverse peptides
hist, bins = np.histogram(reversed_peptides[reversed_peptides["MSGFDB_SpecEValue"] < 0.000000002]["MSGFDB_SpecEValue"], bins=18)
ax_inset.bar(bins[:-1], hist, width=np.diff(bins), align='edge',color='r', alpha=0.7)

# filter cutoff
ax_inset.axvline(x=10 ** fitted_params[0], color="black", label = 'filter cutoff', linestyle="--")
ax_inset.set_xlabel('MSGFDB_SpecEValue')

plt.show
Out[16]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Apply the filter to the dataset and recalculate peptide and spectral FDR.

In [17]:
#Filter the data according to the filter and recalculate FDR
forward_peptides = forward_peptides[
    (forward_peptides["MSGFDB_SpecEValue"] < 10 ** optimization.x[0])
].copy()

reversed_peptides = reversed_peptides[
    (reversed_peptides["MSGFDB_SpecEValue"] < 10 ** optimization.x[0])
].copy()

# Calculate FDR
f_spec = (forward_peptides["SpecID"].unique().size)
r_spec = reversed_peptides["SpecID"].unique().size
if (f_spec == 0) & (r_spec == 0):
    fdr_spec = 1
else:
    fdr_spec = (2*r_spec) / (f_spec + r_spec)

f_pep = forward_peptides["Peptide Sequence with Mods"].unique().size
r_pep = reversed_peptides["Peptide Sequence with Mods"].unique().size
if (f_pep == 0) & (r_pep == 0):
    fdr_pep = 1
else:
    fdr_pep = (r_pep) / (f_pep + r_pep)

del f_spec, r_spec, f_pep, r_pep

print("Spectral FDR:",fdr_spec,"\nPeptide FDR:",fdr_pep)
Spectral FDR: 0.049787398531117126 
Peptide FDR: 0.05560704355885079

3) Collapse to unique peptides and normalize their relative abundance¶

At this point in analysis the data has been filtered to only high probability peptide identifications, but more than one scan within a dataset can have the same peptide identification. This can be due to the peptide eluting into the mass spectrometer over the course of multiple scans or a peptide eluting as multiple charge states. Sum the relative abundance for peptide sequences detected more than once in a dataset, leaving a total relative abundance value for each peptide in each dataset.

In [18]:
#extract data set id
forward_peptides['processed_DO_id'] = forward_peptides['SpecID'].str.split('_').str[0]

#drop SpecID and spectral probability columns since no longer relevant
forward_peptides.drop(['SpecID','MSGFDB_SpecEValue'],axis=1, inplace=True)

#for each peptide sequence with mods, sum the abundances for all scans/identifications
forward_peptides = forward_peptides.groupby(['processed_DO_id','Peptide Sequence with Mods'])['StatMomentsArea'].sum().to_frame().reset_index()

display(forward_peptides)
processed_DO_id Peptide Sequence with Mods StatMomentsArea
0 nmdc:dobj-11-2k6s1505 AALAANSELMNK 115710000.0
1 nmdc:dobj-11-2k6s1505 AFTVDEMR 59240000.0
2 nmdc:dobj-11-2k6s1505 AGGSITSSGGVERNDVDYSAVLTK 5785600.0
... ... ... ...
9757 nmdc:dobj-11-zn9eyg46 VAELEATTAR 521284000.0
9758 nmdc:dobj-11-zn9eyg46 VAELEATTVR 20899000.0
9759 nmdc:dobj-11-zn9eyg46 WLWNYYQYYK 582031.0

9760 rows × 3 columns

Visualize the untransformed and un-normalized relative abundances.

In [19]:
#plot untransformed abundance values
untransformed_abundances_fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(x='processed_DO_id',y='StatMomentsArea',data=forward_peptides)
plt.ticklabel_format(style='plain', axis='y')
plt.xticks([])
plt.xlabel('sample')
plt.ylabel('Relative Peptide Abundance (Un-Normalized)')
Out[19]:
Text(0, 0.5, 'Relative Peptide Abundance (Un-Normalized)')
No description has been provided for this image

Apply log2 transformation and median normalize peptide abundances.

In [20]:
#log2 tranformation
forward_peptides['StatMomentsAreaLog2']=np.log2(forward_peptides['StatMomentsArea'])

# Calculate group-wise (sample wise) median
group_medians = forward_peptides.groupby('processed_DO_id')['StatMomentsAreaLog2'].median()

# Calculate data wide median
all_data_median=forward_peptides['StatMomentsAreaLog2'].median()

# Subtract sample wise median from each value within its group
forward_peptides['StatMomentsAreaLog+Norm'] = forward_peptides.apply(
    lambda row: row['StatMomentsAreaLog2'] - group_medians[row['processed_DO_id']], axis=1
)

#add back in a data wide median value to avoid negative abundances
forward_peptides['StatMomentsAreaLog+Norm'] = forward_peptides['StatMomentsAreaLog+Norm'] + all_data_median

transformed_abundances_fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(x='processed_DO_id',y='StatMomentsAreaLog+Norm',data=forward_peptides)
plt.xticks([])
plt.xlabel('sample')
plt.ylabel('Relative Peptide Abundance (Normalized)')

del ax, group_medians, all_data_median, forward_peptides['StatMomentsArea'], forward_peptides['StatMomentsAreaLog2'], reversed_peptides
No description has been provided for this image

4) Extract functional gene annotations for proteins¶

Collect peptide to protein mapping information for the passing peptide sequences.

In [21]:
peptide_protein_mapping = pd.DataFrame(unfilt_res[unfilt_res['Peptide Sequence with Mods'].isin(forward_peptides['Peptide Sequence with Mods'])][['Peptide Sequence with Mods','Protein']].drop_duplicates()).reset_index(drop=True)
peptide_protein_mapping
Out[21]:
Peptide Sequence with Mods Protein
0 TDAGVNLGGTAVDGGGIR nmdc:wfmgan-11-bnzfrb88.1_150866_1_309
1 ETNPTLADIEQDSK nmdc:wfmgan-11-bnzfrb88.1_028542_140_466
2 DYGTGNVYEMR nmdc:wfmgan-11-bnzfrb88.1_129219_3_329
... ... ...
16091 ATGYSAADEVAK nmdc:wfmgan-11-yjeh0m30.1_040299_186_584
16092 FAGGAAYLAK nmdc:wfmgan-11-yjeh0m30.1_143527_1_312
16093 LGADYNPYASEGNR nmdc:wfmgan-11-yjeh0m30.1_052869_1_507

16094 rows × 2 columns

Annotation information for these proteins can be found in 'Functional Annotation GFF' files.

Since the version_output dataframe contains all workflow v1.2.1 objects associated with our study id, it also contains the relevant 'Functional Annotation GFF' files. Subset this dataframe to GFF files associated with the 31 biosample ids that have a proteomic output in proteomic_output_df.

In [22]:
# Get the GFF inputs for these workflow version results
workflow_inputs = version_output.drop(columns=["has_output","version"]).explode("has_input")['has_input']
gffs = data_objects[data_objects['data_object_type']=="Functional Annotation GFF"]
workflow_gffs = gffs[gffs['id'].isin(workflow_inputs)]
workflow_gffs
Out[22]:
id name file_size_bytes data_object_type md5_checksum url biosample_id in_manifest was_generated_by
55 nmdc:dobj-11-fdngms90 nmdc_wfmgan-11-bnzfrb88.1_functional_annotatio... 77411808.0 Functional Annotation GFF e4344ebdca641bd24709f89b8ff83dd3 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-0jw5n594 NaN NaN
37 nmdc:dobj-11-p55wx268 nmdc_wfmgan-11-vcrp9p47.1_functional_annotatio... 36135504.0 Functional Annotation GFF 56ca7bb11c7a299af52bfb23d2bfefa0 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-2fw5j754 NaN NaN
56 nmdc:dobj-11-wawgpk38 nmdc_wfmgan-11-awd7ke72.1_functional_annotatio... 30929811.0 Functional Annotation GFF e6c19dc9aec8d9f5d50e261a576b6e03 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-5bfhqy53 NaN NaN
... ... ... ... ... ... ... ... ... ...
65 nmdc:dobj-11-tvxks221 nmdc_wfmgan-11-ddkkwt71.1_functional_annotatio... 45389988.0 Functional Annotation GFF ec48bde738b0b15a6503e78111905785 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-peafgc08 NaN NaN
37 nmdc:dobj-11-ysn0es28 nmdc_wfmgan-11-hyze3t58.1_functional_annotatio... 25430869.0 Functional Annotation GFF 58f9110cb647026e81738c6d1e9e435a https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-sepcxd06 NaN NaN
46 nmdc:dobj-11-92qp1w39 nmdc_wfmgan-11-yjeh0m30.1_functional_annotatio... 76608329.0 Functional Annotation GFF cda53e77cb70328498abc29685baf61c https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-yjp2dx65 NaN NaN

31 rows × 9 columns

Preview the 'Functional Annotation GFF' files and determine a subset of gene annotation information that should be pulled from all 31 files.

In [23]:
#example annotation file
annotation_file=workflow_gffs.iloc[0]["url"]
print(annotation_file)
display(agg_func.gff_extract_features(annotation_file))

del annotation_file
https://data.microbiomedata.org/data/nmdc:omprc-13-e4rebn56/nmdc:wfmgan-11-bnzfrb88.1/nmdc_wfmgan-11-bnzfrb88.1_functional_annotation.gff
ID translation_table partial start_type product product_source cath_funfam cog pfam superfamily ... bound_moiety ncRNA_class intron_start intron_end number_of_repeats average_repeat_length median_repeat_length average_spacer_length median_spacer_length Parent
0 nmdc:wfmgan-11-bnzfrb88.1_000001_3_2729 11 5' Edge diaminopimelate decarboxylase/amino acid trans... COG0019/COG0531 2.40.37.10,3.20.20.10 COG0019,COG0531 PF00278,PF02784,PF13520 51419 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 nmdc:wfmgan-11-bnzfrb88.1_000001_2750_2869 11 NaN ATG hypothetical protein Hypo-rule applied NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 nmdc:wfmgan-11-bnzfrb88.1_000001_2910_3746 11 NaN ATG biotin carboxyl carrier protein COG0511 2.70.70.10 COG0511 PF01551 51230 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
272943 nmdc:wfmgan-11-bnzfrb88.1_208965_2_199 11 5',3' Edge F-type H+/Na+-transporting ATPase subunit alpha KO:K02111 1.20.150.20 COG0056 PF00306 47917 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272944 nmdc:wfmgan-11-bnzfrb88.1_208966_2_199 11 5',3' Edge S-adenosyl-L-methionine hydrolase (adenosine-f... KO:K22205 3.40.50.10790 COG1912 PF01887 102522 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272945 nmdc:wfmgan-11-bnzfrb88.1_208967_69_200 11 3' ATG DNA-binding NarL/FixJ family response regulator COG2197 3.40.50.2300 COG2197 NaN 52172 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

272946 rows × 33 columns

Extract information from all 31 annotation files (this takes a while to run) and merge with peptide_protein_mapping so there is a final table of peptide-protein-annotation mapping (annotation_mapping).

In [24]:
#get gene annotation information for proteins
genemapping = workflow_gffs[["id", "url"]].drop_duplicates().reset_index(drop=True)

genemapping = agg_func.iterate_file_extract(
    pd_df=genemapping,\
    identifier_col='id',\
    url_col='url',\
    extract_cols=['ID','product','product_source'],\
    filter_col = 'ID',
    filter_values = peptide_protein_mapping['Protein'].unique().tolist(),
    file_type='gff'
)

#merge with protein mapping information. drop columns ID (which is equivalent to Protein) and id_col (which is the dataset id, unnecessary here since peptide to protein information isn't dataset specific)
annotation_mapping = genemapping.merge(peptide_protein_mapping,left_on='ID',right_on='Protein').drop(['ID','id_col'],axis=1)

annotation_mapping
Out[24]:
product product_source Peptide Sequence with Mods Protein
0 DNA-directed RNA polymerase subunit beta KO:K03043 ALMGSNMQR nmdc:wfmgan-11-bnzfrb88.1_000010_2256_6224
1 heme-degrading monooxygenase HmoA COG2329 ENADAYSRGAYPDVLK nmdc:wfmgan-11-bnzfrb88.1_000125_3919_4230
2 uncharacterized protein KO:K09927 RGTWWDWAPAK nmdc:wfmgan-11-bnzfrb88.1_000220_2203_3417
... ... ... ... ...
16091 serine protease Do KO:K04771 IGEHVIAIGNPFGLSGSMTEGIVSGLGR nmdc:wfmgan-11-yjeh0m30.1_206366_1_240
16092 DNA-binding beta-propeller fold protein YncE COG3391 NLIYVIDSIQNK nmdc:wfmgan-11-yjeh0m30.1_206552_2_217
16093 hypothetical protein Hypo-rule applied DIQLDPESHLLYAGAR nmdc:wfmgan-11-yjeh0m30.1_206566_2_214

16094 rows × 4 columns

5) Generate annotation and protein mappings for peptides using "Razor" strategy¶

Identify the razor protein, which is a method of limiting the assignment of degenerate peptides (i.e., peptides that map to more than one forward protein) to a most likely matched protein. 'Razor' references the principle Occam's razor, also known as the law of parsimony.

The rules are as follows:

  • If a peptide is unique to a protein, then that protein is the razor
  • Else, if a peptide belongs to more than one protein, but one of those proteins has a unique peptide, then that protein is the razor
  • Else, if a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that protein is the razor
  • Else, if a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then collapse the proteins and gene annotations into single strings
  • Else, if a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive

Use annotation_mapping as the input to the function razorprotein() from the agg_func script. This will return protein and gene annotation information for each peptide, according to the above rules.

In [25]:
razormapping = agg_func.razorprotein(annotation_mapping)
razormapping
Out[25]:
product product_source Peptide Sequence with Mods razor
0 dodecin KO:K09165 AAAAAIETASR nmdc:wfmgan-11-p3dkkz54.1_002072_639_854
1 dodecin, dodecin, dodecin, dodecin, dodecin KO:K09165, KO:K09165, KO:K09165, KO:K09165, KO... AAAAAVETASR nmdc:wfmgan-11-6pq4bd13.1_044073_3_326, nmdc:w...
2 subtilisin family serine protease COG1404 AAAEPFHTAR nmdc:wfmgan-11-21ms6426.1_002328_1282_2322
... ... ... ... ...
4828 hypothetical protein Hypo-rule applied YYADISAWLVDVR nmdc:wfmgan-11-21ms6426.1_128887_2_331
4829 hypothetical protein Hypo-rule applied YYGFNWGR nmdc:wfmgan-11-dvvh0237.1_120316_3_251
4830 uncharacterized protein KO:K06996 YYSELFGWQVHEVMPTYGLVHTEAGGK nmdc:wfmgan-11-v53k8v91.1_045759_189_458

4831 rows × 4 columns

6) Perform protein rollup and summarize into a final aggregated table of relative protein abundance¶

Combine razor information with relative abundance values.

In [26]:
#merge needs to be 'right' because some peptides are removed in mapping functions if they have indeterminante mappings
forward_peptides = forward_peptides.merge(razormapping,how='right',on=['Peptide Sequence with Mods'])
del annotation_mapping, peptide_protein_mapping, genemapping

forward_peptides
Out[26]:
processed_DO_id Peptide Sequence with Mods StatMomentsAreaLog+Norm product product_source razor
0 nmdc:dobj-11-vjmd5h28 AAAAAIETASR 30.820153 dodecin KO:K09165 nmdc:wfmgan-11-p3dkkz54.1_002072_639_854
1 nmdc:dobj-11-tk9hc081 AAAAAVETASR 29.374877 dodecin, dodecin, dodecin, dodecin, dodecin KO:K09165, KO:K09165, KO:K09165, KO:K09165, KO... nmdc:wfmgan-11-6pq4bd13.1_044073_3_326, nmdc:w...
2 nmdc:dobj-11-sx7cyr58 AAAEPFHTAR 28.081336 subtilisin family serine protease COG1404 nmdc:wfmgan-11-21ms6426.1_002328_1282_2322
... ... ... ... ... ... ...
8599 nmdc:dobj-11-sx7cyr58 YYADISAWLVDVR 27.222319 hypothetical protein Hypo-rule applied nmdc:wfmgan-11-21ms6426.1_128887_2_331
8600 nmdc:dobj-11-xm1yjv87 YYGFNWGR 23.220936 hypothetical protein Hypo-rule applied nmdc:wfmgan-11-dvvh0237.1_120316_3_251
8601 nmdc:dobj-11-7psagy79 YYSELFGWQVHEVMPTYGLVHTEAGGK 27.015979 uncharacterized protein KO:K06996 nmdc:wfmgan-11-v53k8v91.1_045759_189_458

8602 rows × 6 columns

De-log the peptide abundances, sum the abundances for each razor protein and log transform the rolled up protein abundances.

In [27]:
#de-log abundance
forward_peptides['StatMomentsAreaNorm']=2**forward_peptides['StatMomentsAreaLog+Norm']
forward_peptides.drop('StatMomentsAreaLog+Norm',axis=1,inplace=True)

#sum abundance for each sorted_list protein in each dataset
prot_info = forward_peptides.columns[~forward_peptides.columns.isin(['StatMomentsAreaNorm','Peptide Sequence with Mods'])].tolist()
protein_abundances = forward_peptides.groupby(prot_info)['StatMomentsAreaNorm'].sum().reset_index()

#re-log2 tranform abundance
protein_abundances['StatMomentsAreaLog+Norm']=np.log2(protein_abundances['StatMomentsAreaNorm'])
protein_abundances.drop('StatMomentsAreaNorm',axis=1,inplace=True)

del prot_info, forward_peptides

protein_abundances
Out[27]:
processed_DO_id product product_source razor StatMomentsAreaLog+Norm
0 nmdc:dobj-11-2k6s1505 DNA-binding protein HU-beta, DNA-binding prote... KO:K03530, KO:K03530 nmdc:wfmgan-11-8zxjs882.1_040719_104_331, nmdc... 25.187616
1 nmdc:dobj-11-2k6s1505 DNA-binding transcriptional LysR family regula... COG0583, COG0583, COG0583, COG0583 nmdc:wfmgan-11-21ms6426.1_010178_1_894, nmdc:w... 26.332428
2 nmdc:dobj-11-2k6s1505 DNA-directed RNA polymerase subunit beta' KO:K03046 nmdc:wfmgan-11-yjeh0m30.1_003176_3_1979 27.042324
... ... ... ... ... ...
6450 nmdc:dobj-11-zn9eyg46 peptidoglycan/xylan/chitin deacetylase (PgdA/C... COG0726, COG0726, COG0726 nmdc:wfmgan-11-6pq4bd13.1_001294_1_411, nmdc:w... 25.123978
6451 nmdc:dobj-11-zn9eyg46 photosystem II CP43 chlorophyll apoprotein, ph... KO:K02705, KO:K02705, KO:K02705 nmdc:wfmgan-11-hyze3t58.1_07201_3_464, nmdc:wf... 28.599485
6452 nmdc:dobj-11-zn9eyg46 predicted porin COG3203 nmdc:wfmgan-11-hyze3t58.1_02099_286_783 29.142025

6453 rows × 5 columns

Final aggregated table of relative protein abundance¶

Reformat these results into a proteomic table, where each row indicates a protein and each column indicates a sample/dataset. The values within are log transformed, median normalized relative abundance values. This table or the longform version above can be used in further proteomic analyses.

In [28]:
#pivot to wide crosstab
aggregated_proteomic_output = protein_abundances.pivot(index='razor',columns='processed_DO_id',values='StatMomentsAreaLog+Norm')
aggregated_proteomic_output.columns.name = None

aggregated_proteomic_output
Out[28]:
nmdc:dobj-11-2k6s1505 nmdc:dobj-11-2szyc560 nmdc:dobj-11-301yh759 nmdc:dobj-11-3378mg86 nmdc:dobj-11-3ak2bc31 nmdc:dobj-11-4bwzhs42 nmdc:dobj-11-7psagy79 nmdc:dobj-11-7r5yea49 nmdc:dobj-11-7x21d450 nmdc:dobj-11-bxyvx506 ... nmdc:dobj-11-tk9hc081 nmdc:dobj-11-vjmd5h28 nmdc:dobj-11-wcn69x37 nmdc:dobj-11-xm1yjv87 nmdc:dobj-11-y2aefn18 nmdc:dobj-11-yrbs4v96 nmdc:dobj-11-z47wnf07 nmdc:dobj-11-z8hc0j30 nmdc:dobj-11-zhz66007 nmdc:dobj-11-zn9eyg46
razor
nmdc:wfmgan-11-21ms6426.1_000036_702_1004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-21ms6426.1_000053_3_563, nmdc:wfmgan-11-p3dkkz54.1_008013_251_1072 NaN NaN NaN NaN NaN NaN 30.499817 NaN NaN NaN ... 29.083049 32.948118 NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-21ms6426.1_000057_4146_5369, nmdc:wfmgan-11-p3dkkz54.1_002120_389_1612, nmdc:wfmgan-11-z0raab34.1_009908_3_737 NaN 30.067875 28.391404 28.935731 NaN NaN 30.458277 NaN NaN 28.7232 ... NaN 30.467756 26.462791 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
nmdc:wfmgan-11-z0raab34.1_148395_2_283 NaN 31.373664 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-z0raab34.1_160266_3_278 NaN 27.750197 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-z0raab34.1_167280_3_278 NaN 24.717223 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3442 rows × 31 columns

The generated protein table can be used as input to the software pmartR, which performs statistical analyses such as ANOVA and independence of missing data (IMD) tests. In this case, the aggregated proteomics table (aggregated_proteomic_output) would be equivalent to pmartR's e_data and the peptide to protein to gene mappings (razor_mapping) would be equivalent to pmartR's e_meta.

No description has been provided for this image

pmartR requires sample metadata to parameterize analyses and interpret the data. For this example dataset, an API call will capture the biosample metadata (sample_metadata) that would be equivalent to pmartR's f_data.

Gather biosample metadata via the nmdc_api_utilites BiosampleSeach object, using the function get_record_by_filter() in biosample_ids associated with each output in proteomic_output_df.

In [29]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
# since we are querying the biosample collection, we need to create an instance of it
bs_client = BiosampleSearch(env=ENV)

# create a list of lists of ids to query the BiosampleSearch object
chunked_list = split_list(proteomic_output_df['biosample_id'].unique().tolist())
biosamples = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the BiosampleSearch object looking for records where the id field is in the chunk of ids from proteomic_output_df
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"id": {{"$in": {filter_list}}}}}'
    # get the results
    biosamples += bs_client.get_record_by_filter(filter=filter,fields="id,depth.has_numeric_value", max_page_size=100, all_pages=True)

#normalize the json columns
biosamples = pd.json_normalize(biosamples).drop_duplicates().rename(columns={'id':'biosample_id'})

#merge biosample and processed data object information
sample_metadata = biosamples.merge(proteomic_output_df[['processed_DO_id','biosample_id']],on='biosample_id')
del biosamples

sample_metadata
Out[29]:
biosample_id depth.has_numeric_value processed_DO_id
0 nmdc:bsm-13-0jw5n594 0.0 nmdc:dobj-11-pxnhdj84
1 nmdc:bsm-13-13145k83 0.2 nmdc:dobj-11-xm1yjv87
2 nmdc:bsm-13-1p0tct86 0.2 nmdc:dobj-11-bxyvx506
... ... ... ...
28 nmdc:bsm-13-xvf06c81 0.4 nmdc:dobj-11-m808mj05
29 nmdc:bsm-13-yjp2dx65 0.1 nmdc:dobj-11-301yh759
30 nmdc:bsm-13-zms1jq91 0.5 nmdc:dobj-11-sx7cyr58

31 rows × 3 columns