Proteomic Data Aggregation and Visualization¶

This notebook demonstrates aggregation of proteomic data via the National Microbiome Data Collaborative (NMDC)'s Runtime API. It uses the nmdc_api_utilities package (as of March 2025) to do this exploration. It involves using nmdc_api_utilites objects to make NMDC API requests easier, and using utility function to help us with data processing tasks. More information about the package can be found here. It highlights how the NMDC's schema can be used to overcome some of the numerous challenges associated with this type of aggregation. Please note that this notebook is intended for individuals with experience performing mass spectrometry based proteomic analyses and that various parameter and processing choices were made for this example use case. They are not broadly applicable and should be adjusted as needed.

Notebook Steps:

  1. Assess background information and collect datasets for an example study of riverbed sediment along the Columbia River

  2. Apply a spectral probability filter across the data that optimizes the number of identifications for an FDR of 0.05

  3. Collapse to unique peptides and normalize quantification

  4. Extract functional gene annotations for proteins

  5. Generate annotation and protein mappings for peptides using "Razor" strategy

  6. Perform protein rollup using the "Razor" results and summarize into an aggregated table of relative protein abundance

Define a function to split a list into chunks¶

Since we will need to use a list of ids to query a new collection in the API, we need to limit the number of ids we put in a query. This function splits a list into chunks of 100. Note that the chunk_size has a default of 100, but can be adjusted.

In [3]:
# Define a function to split ids into chunks
def split_list(input_list, chunk_size=100):
    result = []
    
    for i in range(0, len(input_list), chunk_size):
        result.append(input_list[i:i + chunk_size])
        
    return result

Import libraries and python scripts containing functions necessary to run this notebook. 'aggregation_functions.py' (also in this folder) includes spectral probability filtering and protein mapping functions. nmdc_api_utilities includes functions for API traversal of the collections endpoint.

In [5]:
import requests
from io import StringIO
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import sys
from IPython.display import Image, display
import nmdc_api_utilities

if 'google.colab' in sys.modules:

   #module in this folder with specific protein aggregation functions
   !wget https://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/proteomic_aggregation/python/aggregation_functions.py
   import aggregation_functions as agg_func

   #pmartR logo
   !wget https://raw.githubusercontent.com/microbiomedata/nmdc_notebooks/refs/heads/main/proteomic_aggregation/pmartR_logo_final.jpg

else:
  #module in this folder with specific protein aggregation functions
  import aggregation_functions as agg_func

1) Assess background information and collect data for an example study of riverbed sediment along the Columbia River¶

Review the example study on the NMDC data portal. Use the study id embedded in the url (nmdc:sty-11-aygzgv51) to collect all related data objects via nmdc_api_utilities package and reformat the json output into a pandas dataframe. These data objects reference both input files (i.e. raw, gff) and output files (i.e. metaproteomic results) to the NMDC workflows. To get the data_objects, we will create a DataObjectSearch object and use the get_data_objects_for_studies() function.

In [6]:
pd.set_option("display.max_rows", 6)
from nmdc_api_utilities.data_object_search import DataObjectSearch
do_client = DataObjectSearch(env=ENV)
#get all data objects associated with this study id using the NMDC find endpoint
data = do_client.get_data_objects_for_studies(study_id='nmdc:sty-11-aygzgv51')
data = pd.DataFrame(data)

#reformat data into dataframe (keeping biosample id)
data_objects=[]
for index, row in data.iterrows():
    bio_id = row['biosample_id']
    row_out = pd.json_normalize(row['data_objects'])
    row_out['biosample_id'] = bio_id
    data_objects.append(row_out)

data_objects = pd.concat(data_objects)[['id','name','file_size_bytes','data_object_type','md5_checksum','url','biosample_id','in_manifest']]
display(data_objects)

del data, index, row, row_out, bio_id
id name file_size_bytes data_object_type md5_checksum url biosample_id in_manifest
0 nmdc:dobj-11-gxq95j86 nmdc_wfmgan-11-bnzfrb88.1_prodigal.gff 90723471 Prodigal Annotation GFF 47f381a56527a186939f638fa0486d5e https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-0jw5n594 NaN
1 nmdc:dobj-11-b8aph505 nmdc_wfmgan-11-bnzfrb88.1_imgap.info 470 Annotation Info File 2854a85740fdd20dd61bbe5c9897cc24 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-0jw5n594 NaN
2 nmdc:dobj-11-td9k2w62 511406_Froze_Core_2015_N1_0_10_13_Lip_POS_28Ju... 12668 Configuration toml 6c5596d6f9f0446f1a0a628ff25a860a https://nmdcdemo.emsl.pnnl.gov/lipidomics/steg... nmdc:bsm-13-0jw5n594 NaN
... ... ... ... ... ... ... ... ...
7 nmdc:dobj-13-kay9ez18 Unground_SBR_Spring_2014_FC_N3_20-30_MeOHExt_2... 74228 Direct Infusion FT-ICR MS Analysis Results afc41e08ea7aba78e8fe3d3489db64c0 https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung... nmdc:bsm-13-zq681s85 NaN
8 nmdc:dobj-13-c6rdvy62 output: Unground_SBR_Spring_2014_FC_N3_20-30_H... 30556452 Direct Infusion FT ICR-MS Raw Data NaN NaN nmdc:bsm-13-zq681s85 NaN
9 nmdc:dobj-13-xq8sqj89 Unground_SBR_Spring_2014_FC_N3_20-30_H2Oext_22... 10636 Direct Infusion FT-ICR MS Analysis Results a47e870bdf10041849e36f66ab34177a https://nmdcdemo.emsl.pnnl.gov/nom/results/Ung... nmdc:bsm-13-zq681s85 NaN

3515 rows × 8 columns

Subset the data objects to 'Unfiltered Metaproteomic Results'. These files contain the proteomic workflow outputs that will be used for proteomic aggregation.

In [7]:
proteomic_output_df = data_objects[data_objects['data_object_type']=='Unfiltered Metaproteomics Results'].reset_index(drop=True).rename(columns={'id':'processed_DO_id'})
#how many result files exist for each biosample?
display(proteomic_output_df.groupby('biosample_id')['data_object_type'].value_counts().reset_index())
biosample_id data_object_type count
0 nmdc:bsm-13-0jw5n594 Unfiltered Metaproteomics Results 2
1 nmdc:bsm-13-13145k83 Unfiltered Metaproteomics Results 2
2 nmdc:bsm-13-1p0tct86 Unfiltered Metaproteomics Results 2
... ... ... ...
28 nmdc:bsm-13-xvf06c81 Unfiltered Metaproteomics Results 2
29 nmdc:bsm-13-yjp2dx65 Unfiltered Metaproteomics Results 2
30 nmdc:bsm-13-zms1jq91 Unfiltered Metaproteomics Results 2

31 rows × 3 columns

Since there is more than one result file for each biosample, choose a workflow version to use in the analysis.

In [8]:
#Instantiate instances for nmdc_api_utilities
from nmdc_api_utilities.workflow_execution_search import WorkflowExecutionSearch
from nmdc_api_utilities.data_processing import DataProcessing
# since we are querying the workflow collection, we need to create an instance of it
we_client = WorkflowExecutionSearch(env=ENV)
# create an instance of the data processing object
dp_client = DataProcessing()

# create a list of lists of processed proteomic result ids to query the workflow execution objects
chunked_list = split_list(proteomic_output_df['processed_DO_id'].unique().tolist())
version_output = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the WorkflowExecutionSearch object looking for records
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"has_output": {{"$in": {filter_list}}}}}'
    # get the results
    version_output += we_client.get_record_by_filter(filter=filter,fields="has_output,version", max_page_size=100, all_pages=True)


#normalize the json columns and merge
proteomic_output_df = pd.json_normalize(version_output).rename(columns={'id':'workflow_id','version':'workflow_version'}).explode('has_output').drop_duplicates().merge(proteomic_output_df, left_on='has_output',right_on='processed_DO_id').drop('has_output',axis=1)

proteomic_output_df = proteomic_output_df[proteomic_output_df['workflow_version'].str.contains('v1.2.1',na=False)]

proteomic_output_df
Out[8]:
workflow_id workflow_version processed_DO_id name file_size_bytes data_object_type md5_checksum url biosample_id in_manifest
3 nmdc:wfmp-11-j2tktj20.1 v1.2.1 nmdc:dobj-11-2k6s1505 nmdc_dobj-11-ms76kj12_nmdc_dobj-11-c5av0320_ms... 5471346 Unfiltered Metaproteomics Results 1badc622f20cf869077284966bb2810f https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-8e1rjf10 [nmdc:manif-11-7796sg87]
4 nmdc:wfmp-11-3z777070.1 v1.2.1 nmdc:dobj-11-2szyc560 nmdc_dobj-11-tezmee57_nmdc_dobj-11-w2c1ee09_ms... 8117239 Unfiltered Metaproteomics Results fd8a212ad33a8162857d0ca164760333 https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-vqk1y255 [nmdc:manif-11-7796sg87]
5 nmdc:wfmp-11-7h4m6t36.1 v1.2.1 nmdc:dobj-11-301yh759 nmdc_dobj-11-zd25ft61_nmdc_dobj-11-vek6s281_ms... 10756563 Unfiltered Metaproteomics Results c82e8026a2d974cb4f9a75b6eef5c62b https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-yjp2dx65 [nmdc:manif-11-7796sg87]
... ... ... ... ... ... ... ... ... ... ...
59 nmdc:wfmp-11-k5qcay19.1 v1.2.1 nmdc:dobj-11-z8hc0j30 nmdc_dobj-11-hp2vmj63_nmdc_dobj-11-716e7y35_ms... 8376109 Unfiltered Metaproteomics Results 8785b9538d977a50fd35d11bfcb037ce https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-2fw5j754 [nmdc:manif-11-7796sg87]
60 nmdc:wfmp-11-f6sfn088.1 v1.2.1 nmdc:dobj-11-zhz66007 nmdc_dobj-11-mzxj8743_nmdc_dobj-11-pxd5ez06_ms... 9983187 Unfiltered Metaproteomics Results d596283fd19d30d8708afab0a8677e67 https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-as680w10 [nmdc:manif-11-7796sg87]
61 nmdc:wfmp-11-xm389s70.1 v1.2.1 nmdc:dobj-11-zn9eyg46 nmdc_dobj-11-6n49rm30_nmdc_dobj-11-nb9f8z34_ms... 2321178 Unfiltered Metaproteomics Results 8eb29a562832f1a8862a5a643e297389 https://nmdcdemo.emsl.pnnl.gov/proteomics/resu... nmdc:bsm-13-sepcxd06 [nmdc:manif-11-7796sg87]

31 rows × 10 columns

There are various requirements that enable mass spectrometry runs to be aggregated and analyzed together. For example, runs need to be performed in succession, on the same instrument. The NMDC schema can make it easier to find these proteomic results by linking them via a slot called in_manifest.

Look at the in_manifest id on these proteomic outputs to confirm that all runs are in the same manifest record, and pull that record. If that manifest record's manifest_category value is 'instrument_run', then it confirms that these are LC-MS/MS runs that were performed in succession on the same instrument. Proteomic outputs from different manifest records should not be aggregated.

Using the nmdc_api_utilities package, we create a ManifestSearch object, create the needed filter, and use the get_record_by_filter() function to get records.

In [9]:
from nmdc_api_utilities.manifest_search import ManifestSearch
#extract in_manifest ids for this study (in this case a single one available)
manifest_id = proteomic_output_df.explode('in_manifest')['in_manifest'].unique()
# since we are querying the manifest collection, we need to create an instance of it
m_client = ManifestSearch(env=ENV)
# create a list of lists of ids to query the ManifestSearch object
chunked_list = split_list(proteomic_output_df['in_manifest'].explode().unique().tolist())
manifest = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the ManifestSearch object looking for recordds where the id field is in the chunk of ids from proteomic_output_df
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"id": {{"$in": {filter_list}}}}}'
    # get the results
    manifest += m_client.get_record_by_filter(filter=filter, max_page_size=100, all_pages=True)

display(manifest)
[{'id': 'nmdc:manif-11-7796sg87',
  'manifest_category': 'instrument_run',
  'type': 'nmdc:Manifest',
  'description': 'collection of metaproteomic analyses from the same instrument run nmdc:sty-11-aygzgv51'}]

Look at an example of the information in 'Unfiltered Metaproteomics Results', which contains peptide identification and relative abundance information.

In [10]:
#example unfiltered results 
unfilt_results = proteomic_output_df.iloc[0]["url"]
print(unfilt_results)
display(agg_func.tsv_extract(unfilt_results))

del unfilt_results
https://nmdcdemo.emsl.pnnl.gov/proteomics/results/nmdc_dobj-11-ms76kj12_nmdc_dobj-11-c5av0320_msgfplus_syn_PlusSICStats.txt
ResultID Scan FragMethod SpecIndex Charge PrecursorMZ DelM DelM_PPM MH Peptide ... PeakMaxIntensity PeakSignalToNoiseRatio FWHMInScans PeakArea ParentIonIntensity ParentIonMZ StatMomentsArea PeakScanStart PeakScanEnd PeakWidthMinutes
0 1 36947 HCD 1 3 968.47449 -0.00469 -1.61740 2902.410370 K.NYSPYYNTIDDLKDQIVDLTVGNNK.T ... 6619700.0 106.00 62 503970000.0 4468100.0 968.47 453600000.0 36866 37232 0
1 2 30576 HCD 2 3 1017.16516 -0.01640 -5.38151 3048.494105 K.VLYDAEISQIHQSVTDTNVILSMDNSR.N ... 2595300.0 343.30 115 227780000.0 2481100.0 1017.17 206570000.0 30497 30787 0
2 3 21671 HCD 3 3 1050.48071 0.02323 7.38083 3148.400992 K.NADSTLHTVTSGTAEGGESGTVFDSSYMAAGK.T ... 2393100.0 165.90 14 51712000.0 1454200.0 1050.48 53478000.0 21598 21730 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22014 22015 10397 HCD 19177 2 535.74500 -0.00116 -1.08134 1069.480463 K.DMMELFQR.A ... 1673900.0 37.75 213 246200000.0 2129400.0 535.74 230580000.0 10205 10427 0
22015 22016 7224 HCD 19179 3 525.91705 -0.01298 -8.23982 1575.749593 K.VEETEKGEEQLER.L ... 2701400.0 20.38 104 291450000.0 2969100.0 526.25 241570000.0 7097 7395 0
22016 22017 32524 HCD 19178 2 1130.05615 -0.00976 -4.32084 2259.114693 -.SDIPGQDCKMIFGHLDELLK.I ... 736733.0 24.86 59 32040000.0 812098.0 1130.06 27600000.0 32481 32598 0

22017 rows × 36 columns

Extract information from all 33 proteomic results via the function iterate_file_extract() in agg_func, and put them into a single dataframe, where each scan in each dataset has the unique identifier SpecID. Clean prefix and suffix off of each peptide sequence. Since this data was processed using a target-decoy approach, determine the type of protein being matched to each peptide: contaminant, reverse (false positive match to the reversed amino acid sequence of a protein), or forward (match to the true, forward amino acid sequence of a protein). The presence of forward and reverse matches enables FDR estimation in the next step.

In [11]:
unfilt_res = agg_func.iterate_file_extract(identifier_col='processed_DO_id',
                                url_col='url',
                                extract_cols=['Charge','Scan','Peptide','Protein','MSGFDB_SpecEValue','StatMomentsArea'],
                                pd_df=proteomic_output_df,
                                filter_col = None,
                                filter_values = None,
                                file_type='tsv'
                                )


#create identifier for each scan in each dataset
unfilt_res["SpecID"] = unfilt_res.apply(lambda row: str(row["id_col"]) + "_" + str(row["Scan"]), axis=1)

#clean off the prefix and suffix from the sequence but keep any mods
unfilt_res["Peptide Sequence with Mods"] = unfilt_res["Peptide"].apply(agg_func.sequence_noprefsuff)
del unfilt_res['Peptide']

#determine protein type (contaminant, reverse, forward)
unfilt_res["Protein_Type"] = unfilt_res["Protein"].apply(agg_func.findproteinname)

unfilt_res
Out[11]:
Charge Scan Protein MSGFDB_SpecEValue StatMomentsArea id_col SpecID Peptide Sequence with Mods Protein_Type
0 3 36947 Contaminant_K1C9_HUMAN 1.454800e-31 453600000.0 nmdc:dobj-11-2k6s1505 nmdc:dobj-11-2k6s1505_36947 NYSPYYNTIDDLKDQIVDLTVGNNK None
1 3 30576 Contaminant_K22E_HUMAN 7.111100e-31 206570000.0 nmdc:dobj-11-2k6s1505 nmdc:dobj-11-2k6s1505_30576 VLYDAEISQIHQSVTDTNVILSMDNSR None
2 3 21671 nmdc:wfmgan-11-a71gkg84.1_003113_596_1117 1.396200e-28 53478000.0 nmdc:dobj-11-2k6s1505 nmdc:dobj-11-2k6s1505_21671 NADSTLHTVTSGTAEGGESGTVFDSSYMAAGK Forward
... ... ... ... ... ... ... ... ... ...
945188 2 14621 XXX_nmdc:wfmgan-11-hyze3t58.1_29324_190_333 4.997800e-07 336710000.0 nmdc:dobj-11-zn9eyg46 nmdc:dobj-11-zn9eyg46_14621 FSFFYDMR Reversed
945189 2 24865 nmdc:wfmgan-11-hyze3t58.1_08744_41_232 4.998600e-07 363810000.0 nmdc:dobj-11-zn9eyg46 nmdc:dobj-11-zn9eyg46_24865 RTMTEYGHK Forward
945190 2 15161 nmdc:wfmgan-11-hyze3t58.1_72540_2_109 4.999300e-07 25782000.0 nmdc:dobj-11-zn9eyg46 nmdc:dobj-11-zn9eyg46_15161 WFFIYK Forward

945191 rows × 9 columns

2) Apply a spectral probability filter across the data that optimizes the number of identifications for an FDR of 0.05¶

A challenge associated with aggregating mass spectrometry data is that there are always false identifications, which can be mitigated by imposing a spectral probability filter on the data being analyzed. The same spectral probability filter needs to be applied across datasets when they are being compared. The filter value itself is chosen by weighing the number of 'true' identifications retained with the proximity of the data to a chosen false discovery rate (FDR) (usually 0.05 or 0.01). NMDC's metaproteomic workflow provides 'true' and 'false' identifications for FDR estimation in the 'Unfiltered Metaproteomic Result' files.

Create a dataframe of peptide identifications (ignoring protein mapping). Filter identifications to the peptide sequence with the smallest SpecEValue for each SpecID, so there is a single, highest probability identification for each scan.

In [12]:
edata = unfilt_res[['SpecID','Peptide Sequence with Mods','MSGFDB_SpecEValue','Protein_Type','StatMomentsArea']].drop_duplicates()  # important to remove any duplicated rows here!

#for each SpecID, select the peptide spectrum match with the smallest MSGFDB_SpecEValue (.idxmin() takes the first entry when there's multiple matches)
idx = edata.groupby(['SpecID'])['MSGFDB_SpecEValue'].idxmin()
edata = edata.loc[idx].reset_index(drop=True)
del idx

display(edata)

assert len(edata['SpecID'].unique())==edata.shape[0], "still more than one identification per scan"
SpecID Peptide Sequence with Mods MSGFDB_SpecEValue Protein_Type StatMomentsArea
0 nmdc:dobj-11-2k6s1505_10005 PPDERERSEEAEKRDEERDRVRDELLAGAEEGEPR 4.994300e-07 Forward 1.238500e+09
1 nmdc:dobj-11-2k6s1505_10008 LRAGSEPR 1.172600e-07 Forward 1.234600e+08
2 nmdc:dobj-11-2k6s1505_10009 WAKEIENQK 2.473900e-07 Forward 5.731600e+07
... ... ... ... ... ...
797680 nmdc:dobj-11-zn9eyg46_9976 CRDKWYQMAEGK 4.110300e-07 Reversed 6.141800e+06
797681 nmdc:dobj-11-zn9eyg46_9979 WWQQINK 4.622300e-08 Reversed 4.143000e+07
797682 nmdc:dobj-11-zn9eyg46_9980 RFEECERECK 1.126600e-07 Forward 2.284500e+06

797683 rows × 5 columns

Create separate dataframes of forward and reverse peptide spectrum matches.

In [13]:
forward_peptides = edata[edata["Protein_Type"] == "Forward"].copy().reset_index(drop=True)
del forward_peptides["Protein_Type"]
display(forward_peptides)
SpecID Peptide Sequence with Mods MSGFDB_SpecEValue StatMomentsArea
0 nmdc:dobj-11-2k6s1505_10005 PPDERERSEEAEKRDEERDRVRDELLAGAEEGEPR 4.994300e-07 1.238500e+09
1 nmdc:dobj-11-2k6s1505_10008 LRAGSEPR 1.172600e-07 1.234600e+08
2 nmdc:dobj-11-2k6s1505_10009 WAKEIENQK 2.473900e-07 5.731600e+07
... ... ... ... ...
401082 nmdc:dobj-11-zn9eyg46_9959 DAVSGRHHAQGR 6.589700e-09 1.566600e+07
401083 nmdc:dobj-11-zn9eyg46_9967 FYRFSGR 1.080500e-07 3.418000e+08
401084 nmdc:dobj-11-zn9eyg46_9980 RFEECERECK 1.126600e-07 2.284500e+06

401085 rows × 4 columns

In [14]:
reversed_peptides = edata[edata["Protein_Type"] == "Reversed"].copy().reset_index(drop=True)
del reversed_peptides["Protein_Type"]
del edata
display(reversed_peptides)
SpecID Peptide Sequence with Mods MSGFDB_SpecEValue StatMomentsArea
0 nmdc:dobj-11-2k6s1505_10013 EFVDIISYMENENHSDIEYPLLYKWDSKSTVINR 1.021500e-07 7.371400e+08
1 nmdc:dobj-11-2k6s1505_10019 QWHPNFLR 3.448500e-07 3.220300e+09
2 nmdc:dobj-11-2k6s1505_10026 LAEREGGAR 1.451200e-08 9.604800e+08
... ... ... ... ...
383749 nmdc:dobj-11-zn9eyg46_9969 TDTATHWHK 1.879300e-07 1.960000e+08
383750 nmdc:dobj-11-zn9eyg46_9976 CRDKWYQMAEGK 4.110300e-07 6.141800e+06
383751 nmdc:dobj-11-zn9eyg46_9979 WWQQINK 4.622300e-08 4.143000e+07

383752 rows × 4 columns

Use the function optimize_specFilt() in agg_func to find a log10 spectral probability filter that weighs the number of forward peptides retained with the proximity of the dataset to a 0.05 spectral FDR. Visualize the impact of the spectral probability filter by plotting the number of forward and reverse peptides retained.

The main plot below is a histogram of forward and reverse peptides across all spectral probability values. The inset within this plot depicts a subset of the smallest spectral probabililty values, with the red bar before the dashed line representing the estimated number of false identifications that will be included in this analysis.

In [15]:
#initial guess at a log10 spectral probability filter value
initial_specprob_filter = -15

#perform optimization
optimization = agg_func.optimize_specFilt(initial_specprob_filter,forward_peptides,reversed_peptides)

#visualize filter
fitted_params=optimization.x
fig, ax_main = plt.subplots()
ax_inset = plt.axes([0.45, 0.45, 0.35, 0.35])


# Main plot
# forward peptides
hist, bins = np.histogram(forward_peptides["MSGFDB_SpecEValue"], bins=50)
ax_main.bar(bins[:-1], hist, width=np.diff(bins), align='edge', color='g', label='forward')

# reverse peptides
hist, bins = np.histogram(reversed_peptides["MSGFDB_SpecEValue"], bins=50)
ax_main.bar(bins[:-1], hist, width=np.diff(bins), align='edge',color='r', alpha=0.7, label='reverse')

# filter cutoff
ax_main.axvline(x=10 ** fitted_params[0], color="black", label = 'filter cutoff', linestyle="--")
ax_main.set_ylabel('number of peptide spectrum matches')
ax_main.set_xlabel('MSGFDB_SpecEValue')
ax_main.ticklabel_format(style='plain', axis='x')
ax_main.set_title(f'impact of spectral probability filter {10 ** fitted_params[0]}')
ax_main.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=2)


# Inset plot
# forward peptides
hist, bins = np.histogram(forward_peptides[forward_peptides["MSGFDB_SpecEValue"] < 0.000000002]["MSGFDB_SpecEValue"], bins=18)
ax_inset.bar(bins[:-1], hist, width=np.diff(bins), align='edge', color='g')

# reverse peptides
hist, bins = np.histogram(reversed_peptides[reversed_peptides["MSGFDB_SpecEValue"] < 0.000000002]["MSGFDB_SpecEValue"], bins=18)
ax_inset.bar(bins[:-1], hist, width=np.diff(bins), align='edge',color='r', alpha=0.7)

# filter cutoff
ax_inset.axvline(x=10 ** fitted_params[0], color="black", label = 'filter cutoff', linestyle="--")
ax_inset.set_xlabel('MSGFDB_SpecEValue')

plt.show
Out[15]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Apply the filter to the dataset and recalculate peptide and spectral FDR.

In [16]:
#Filter the data according to the filter and recalculate FDR
forward_peptides = forward_peptides[
    (forward_peptides["MSGFDB_SpecEValue"] < 10 ** optimization.x[0])
].copy()

reversed_peptides = reversed_peptides[
    (reversed_peptides["MSGFDB_SpecEValue"] < 10 ** optimization.x[0])
].copy()

# Calculate FDR
f_spec = (forward_peptides["SpecID"].unique().size)
r_spec = reversed_peptides["SpecID"].unique().size
if (f_spec == 0) & (r_spec == 0):
    fdr_spec = 1
else:
    fdr_spec = (2*r_spec) / (f_spec + r_spec)

f_pep = forward_peptides["Peptide Sequence with Mods"].unique().size
r_pep = reversed_peptides["Peptide Sequence with Mods"].unique().size
if (f_pep == 0) & (r_pep == 0):
    fdr_pep = 1
else:
    fdr_pep = (r_pep) / (f_pep + r_pep)

del f_spec, r_spec, f_pep, r_pep

print("Spectral FDR:",fdr_spec,"\nPeptide FDR:",fdr_pep)
Spectral FDR: 0.049787398531117126 
Peptide FDR: 0.05560704355885079

3) Collapse to unique peptides and normalize their relative abundance¶

At this point in analysis the data has been filtered to only high probability peptide identifications, but more than one scan within a dataset can have the same peptide identification. This can be due to the peptide eluting into the mass spectrometer over the course of multiple scans or a peptide eluting as multiple charge states. Sum the relative abundance for peptide sequences detected more than once in a dataset, leaving a total relative abundance value for each peptide in each dataset.

In [17]:
#extract data set id
forward_peptides['processed_DO_id'] = forward_peptides['SpecID'].str.split('_').str[0]

#drop SpecID and spectral probability columns since no longer relevant
forward_peptides.drop(['SpecID','MSGFDB_SpecEValue'],axis=1, inplace=True)

#for each peptide sequence with mods, sum the abundances for all scans/identifications
forward_peptides = forward_peptides.groupby(['processed_DO_id','Peptide Sequence with Mods'])['StatMomentsArea'].sum().to_frame().reset_index()

display(forward_peptides)
processed_DO_id Peptide Sequence with Mods StatMomentsArea
0 nmdc:dobj-11-2k6s1505 AALAANSELMNK 115710000.0
1 nmdc:dobj-11-2k6s1505 AFTVDEMR 59240000.0
2 nmdc:dobj-11-2k6s1505 AGGSITSSGGVERNDVDYSAVLTK 5785600.0
... ... ... ...
9757 nmdc:dobj-11-zn9eyg46 VAELEATTAR 521284000.0
9758 nmdc:dobj-11-zn9eyg46 VAELEATTVR 20899000.0
9759 nmdc:dobj-11-zn9eyg46 WLWNYYQYYK 582031.0

9760 rows × 3 columns

Visualize the untransformed and un-normalized relative abundances.

In [18]:
#plot untransformed abundance values
untransformed_abundances_fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(x='processed_DO_id',y='StatMomentsArea',data=forward_peptides)
plt.ticklabel_format(style='plain', axis='y')
plt.xticks([])
plt.xlabel('sample')
plt.ylabel('Relative Peptide Abundance (Un-Normalized)')
Out[18]:
Text(0, 0.5, 'Relative Peptide Abundance (Un-Normalized)')
No description has been provided for this image

Apply log2 transformation and median normalize peptide abundances.

In [19]:
#log2 tranformation
forward_peptides['StatMomentsAreaLog2']=np.log2(forward_peptides['StatMomentsArea'])

# Calculate group-wise (sample wise) median
group_medians = forward_peptides.groupby('processed_DO_id')['StatMomentsAreaLog2'].median()

# Calculate data wide median
all_data_median=forward_peptides['StatMomentsAreaLog2'].median()

# Subtract sample wise median from each value within its group
forward_peptides['StatMomentsAreaLog+Norm'] = forward_peptides.apply(
    lambda row: row['StatMomentsAreaLog2'] - group_medians[row['processed_DO_id']], axis=1
)

#add back in a data wide median value to avoid negative abundances
forward_peptides['StatMomentsAreaLog+Norm'] = forward_peptides['StatMomentsAreaLog+Norm'] + all_data_median

transformed_abundances_fig, ax = plt.subplots(figsize=(8,6))
sns.boxplot(x='processed_DO_id',y='StatMomentsAreaLog+Norm',data=forward_peptides)
plt.xticks([])
plt.xlabel('sample')
plt.ylabel('Relative Peptide Abundance (Normalized)')

del ax, group_medians, all_data_median, forward_peptides['StatMomentsArea'], forward_peptides['StatMomentsAreaLog2'], reversed_peptides
No description has been provided for this image

4) Extract functional gene annotations for proteins¶

Collect peptide to protein mapping information for the passing peptide sequences.

In [20]:
peptide_protein_mapping = pd.DataFrame(unfilt_res[unfilt_res['Peptide Sequence with Mods'].isin(forward_peptides['Peptide Sequence with Mods'])][['Peptide Sequence with Mods','Protein']].drop_duplicates()).reset_index(drop=True)
peptide_protein_mapping
Out[20]:
Peptide Sequence with Mods Protein
0 NADSTLHTVTSGTAEGGESGTVFDSSYMAAGK nmdc:wfmgan-11-a71gkg84.1_003113_596_1117
1 ETNPTLADIEQDSKVKDAIEK nmdc:wfmgan-11-a71gkg84.1_046034_35_346
2 GETYVNVHTEKNPNGEIR nmdc:wfmgan-11-a71gkg84.1_000318_745_1245
... ... ...
16091 EDRTPIVEIMNR nmdc:wfmgan-11-hyze3t58.1_63946_3_284
16092 EARFYSLYDKVWR nmdc:wfmgan-11-hyze3t58.1_06697_29_526
16093 FVRIEDETFNR nmdc:wfmgan-11-hyze3t58.1_00200_3_428

16094 rows × 2 columns

Annotation information for these proteins can be found in 'Functional Annotation GFF' files.

Since the data_objects dataframe contains all objects associated with our study id, it also contains the relevant 'Functional Annotation GFF' files. Subset this dataframe to GFF files associated with the 33 biosample ids that have a proteomic output in proteomic_output_df.

In [21]:
annotation_input_df = data_objects[(data_objects['data_object_type']=='Functional Annotation GFF') & (data_objects['biosample_id'].isin(proteomic_output_df['biosample_id'].unique().tolist()))].reset_index(drop=True)
annotation_input_df = annotation_input_df[['id','file_size_bytes','data_object_type','md5_checksum','url','biosample_id']]
display(annotation_input_df)
id file_size_bytes data_object_type md5_checksum url biosample_id
0 nmdc:dobj-11-fdngms90 77411808 Functional Annotation GFF e4344ebdca641bd24709f89b8ff83dd3 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-0jw5n594
1 nmdc:dobj-11-xxffzc65 66314886 Functional Annotation GFF 9051bc023e5b14373dc016e6a3a0943b https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-13145k83
2 nmdc:dobj-11-c24s6186 114196066 Functional Annotation GFF 3049a60a48baca9133461a7cde827b6f https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-1p0tct86
... ... ... ... ... ... ...
28 nmdc:dobj-11-tv82b372 60997438 Functional Annotation GFF cfd4e4c3c17d4d2020f364555836c655 https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-xvf06c81
29 nmdc:dobj-11-92qp1w39 76608329 Functional Annotation GFF cda53e77cb70328498abc29685baf61c https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-yjp2dx65
30 nmdc:dobj-11-2f3gzn94 80194278 Functional Annotation GFF 8e69768342ba2c92cd7feac23a8d933f https://data.microbiomedata.org/data/nmdc:ompr... nmdc:bsm-13-zms1jq91

31 rows × 6 columns

Preview the 'Functional Annotation GFF' files and determine a subset of gene annotation information that should be pulled from all 33 files.

In [22]:
#example annotation file
annotation_file=annotation_input_df.iloc[0]["url"]
print(annotation_file)
display(agg_func.gff_extract_features(annotation_file))

del annotation_file
https://data.microbiomedata.org/data/nmdc:omprc-13-e4rebn56/nmdc:wfmgan-11-bnzfrb88.1/nmdc_wfmgan-11-bnzfrb88.1_functional_annotation.gff
ID translation_table partial start_type product product_source cath_funfam cog pfam superfamily ... bound_moiety ncRNA_class intron_start intron_end number_of_repeats average_repeat_length median_repeat_length average_spacer_length median_spacer_length Parent
0 nmdc:wfmgan-11-bnzfrb88.1_000001_3_2729 11 5' Edge diaminopimelate decarboxylase/amino acid trans... COG0019/COG0531 2.40.37.10,3.20.20.10 COG0019,COG0531 PF00278,PF02784,PF13520 51419 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 nmdc:wfmgan-11-bnzfrb88.1_000001_2750_2869 11 NaN ATG hypothetical protein Hypo-rule applied NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 nmdc:wfmgan-11-bnzfrb88.1_000001_2910_3746 11 NaN ATG biotin carboxyl carrier protein COG0511 2.70.70.10 COG0511 PF01551 51230 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
272943 nmdc:wfmgan-11-bnzfrb88.1_208965_2_199 11 5',3' Edge F-type H+/Na+-transporting ATPase subunit alpha KO:K02111 1.20.150.20 COG0056 PF00306 47917 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272944 nmdc:wfmgan-11-bnzfrb88.1_208966_2_199 11 5',3' Edge S-adenosyl-L-methionine hydrolase (adenosine-f... KO:K22205 3.40.50.10790 COG1912 PF01887 102522 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
272945 nmdc:wfmgan-11-bnzfrb88.1_208967_69_200 11 3' ATG DNA-binding NarL/FixJ family response regulator COG2197 3.40.50.2300 COG2197 NaN 52172 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

272946 rows × 33 columns

Extract information from all 33 annotation files (this takes a while to run) and merge with peptide_protein_mapping so there is a final table of peptide-protein-annotation mapping (annotation_mapping).

In [23]:
#get gene annotation information for proteins
genemapping = annotation_input_df[["id", "url"]].drop_duplicates().reset_index(drop=True)

genemapping = agg_func.iterate_file_extract(
    pd_df=genemapping,\
    identifier_col='id',\
    url_col='url',\
    extract_cols=['ID','product','product_source'],\
    filter_col = 'ID',
    filter_values = peptide_protein_mapping['Protein'].unique().tolist(),
    file_type='gff'
)

#merge with protein mapping information. drop columns ID (which is equivalent to Protein) and id_col (which is the dataset id, unnecessary here since peptide to protein information isn't dataset specific)
annotation_mapping = genemapping.merge(peptide_protein_mapping,left_on='ID',right_on='Protein').drop(['ID','id_col'],axis=1)

annotation_mapping
Out[23]:
product product_source Peptide Sequence with Mods Protein
0 DNA-directed RNA polymerase subunit beta KO:K03043 ALMGSNMQR nmdc:wfmgan-11-bnzfrb88.1_000010_2256_6224
1 heme-degrading monooxygenase HmoA COG2329 ENADAYSRGAYPDVLK nmdc:wfmgan-11-bnzfrb88.1_000125_3919_4230
2 uncharacterized protein KO:K09927 RGTWWDWAPAK nmdc:wfmgan-11-bnzfrb88.1_000220_2203_3417
... ... ... ... ...
16091 hypothetical protein Hypo-rule applied EAPITALDHK nmdc:wfmgan-11-21ms6426.1_215294_58_276
16092 hypothetical protein Hypo-rule applied LWAVDLATGSK nmdc:wfmgan-11-21ms6426.1_216224_1_207
16093 hypothetical protein Hypo-rule applied TLFLDVGSR nmdc:wfmgan-11-21ms6426.1_216224_1_207

16094 rows × 4 columns

5) Generate annotation and protein mappings for peptides using "Razor" strategy¶

Identify the razor protein, which is a method of limiting the assignment of degenerate peptides (i.e., peptides that map to more than one forward protein) to a most likely matched protein. 'Razor' references the principle Occam's razor, also known as the law of parsimony.

The rules are as follows:

  • If a peptide is unique to a protein, then that protein is the razor
  • Else, if a peptide belongs to more than one protein, but one of those proteins has a unique peptide, then that protein is the razor
  • Else, if a peptide belongs to more than one protein and one of those proteins has the maximal number of peptides, then that protein is the razor
  • Else, if a peptide belongs to more than one protein and more than one of those proteins has the maximal number of peptides, then collapse the proteins and gene annotations into single strings
  • Else, if a peptide belongs to more than one protein and more than one of those proteins has a unique peptide, then the peptide is removed from analysis because its mapping is inconclusive

Use annotation_mapping as the input to the function razorprotein() from the agg_func script. This will return protein and gene annotation information for each peptide, according to the above rules.

In [24]:
razormapping = agg_func.razorprotein(annotation_mapping)
razormapping
Out[24]:
product product_source Peptide Sequence with Mods razor
0 dodecin KO:K09165 AAAAAIETASR nmdc:wfmgan-11-p3dkkz54.1_002072_639_854
1 dodecin, dodecin, dodecin, dodecin, dodecin KO:K09165, KO:K09165, KO:K09165, KO:K09165, KO... AAAAAVETASR nmdc:wfmgan-11-6pq4bd13.1_044073_3_326, nmdc:w...
2 subtilisin family serine protease COG1404 AAAEPFHTAR nmdc:wfmgan-11-21ms6426.1_002328_1282_2322
... ... ... ... ...
4828 hypothetical protein Hypo-rule applied YYADISAWLVDVR nmdc:wfmgan-11-21ms6426.1_128887_2_331
4829 hypothetical protein Hypo-rule applied YYGFNWGR nmdc:wfmgan-11-dvvh0237.1_120316_3_251
4830 uncharacterized protein KO:K06996 YYSELFGWQVHEVMPTYGLVHTEAGGK nmdc:wfmgan-11-v53k8v91.1_045759_189_458

4831 rows × 4 columns

6) Perform protein rollup and summarize into a final aggregated table of relative protein abundance¶

Combine razor information with relative abundance values.

In [25]:
#merge needs to be 'right' because some peptides are removed in mapping functions if they have indeterminante mappings
forward_peptides = forward_peptides.merge(razormapping,how='right',on=['Peptide Sequence with Mods'])
del annotation_mapping, peptide_protein_mapping, genemapping

forward_peptides
Out[25]:
processed_DO_id Peptide Sequence with Mods StatMomentsAreaLog+Norm product product_source razor
0 nmdc:dobj-11-vjmd5h28 AAAAAIETASR 30.820153 dodecin KO:K09165 nmdc:wfmgan-11-p3dkkz54.1_002072_639_854
1 nmdc:dobj-11-tk9hc081 AAAAAVETASR 29.374877 dodecin, dodecin, dodecin, dodecin, dodecin KO:K09165, KO:K09165, KO:K09165, KO:K09165, KO... nmdc:wfmgan-11-6pq4bd13.1_044073_3_326, nmdc:w...
2 nmdc:dobj-11-sx7cyr58 AAAEPFHTAR 28.081336 subtilisin family serine protease COG1404 nmdc:wfmgan-11-21ms6426.1_002328_1282_2322
... ... ... ... ... ... ...
8599 nmdc:dobj-11-sx7cyr58 YYADISAWLVDVR 27.222319 hypothetical protein Hypo-rule applied nmdc:wfmgan-11-21ms6426.1_128887_2_331
8600 nmdc:dobj-11-xm1yjv87 YYGFNWGR 23.220936 hypothetical protein Hypo-rule applied nmdc:wfmgan-11-dvvh0237.1_120316_3_251
8601 nmdc:dobj-11-7psagy79 YYSELFGWQVHEVMPTYGLVHTEAGGK 27.015979 uncharacterized protein KO:K06996 nmdc:wfmgan-11-v53k8v91.1_045759_189_458

8602 rows × 6 columns

De-log the peptide abundances, sum the abundances for each razor protein and log transform the rolled up protein abundances.

In [26]:
#de-log abundance
forward_peptides['StatMomentsAreaNorm']=2**forward_peptides['StatMomentsAreaLog+Norm']
forward_peptides.drop('StatMomentsAreaLog+Norm',axis=1,inplace=True)

#sum abundance for each sorted_list protein in each dataset
prot_info = forward_peptides.columns[~forward_peptides.columns.isin(['StatMomentsAreaNorm','Peptide Sequence with Mods'])].tolist()
protein_abundances = forward_peptides.groupby(prot_info)['StatMomentsAreaNorm'].sum().reset_index()

#re-log2 tranform abundance
protein_abundances['StatMomentsAreaLog+Norm']=np.log2(protein_abundances['StatMomentsAreaNorm'])
protein_abundances.drop('StatMomentsAreaNorm',axis=1,inplace=True)

del prot_info, forward_peptides

protein_abundances
Out[26]:
processed_DO_id product product_source razor StatMomentsAreaLog+Norm
0 nmdc:dobj-11-2k6s1505 DNA-binding protein HU-beta, DNA-binding prote... KO:K03530, KO:K03530 nmdc:wfmgan-11-8zxjs882.1_040719_104_331, nmdc... 25.187616
1 nmdc:dobj-11-2k6s1505 DNA-binding transcriptional LysR family regula... COG0583, COG0583, COG0583, COG0583 nmdc:wfmgan-11-21ms6426.1_010178_1_894, nmdc:w... 26.332428
2 nmdc:dobj-11-2k6s1505 DNA-directed RNA polymerase subunit beta' KO:K03046 nmdc:wfmgan-11-yjeh0m30.1_003176_3_1979 27.042324
... ... ... ... ... ...
6450 nmdc:dobj-11-zn9eyg46 peptidoglycan/xylan/chitin deacetylase (PgdA/C... COG0726, COG0726, COG0726 nmdc:wfmgan-11-6pq4bd13.1_001294_1_411, nmdc:w... 25.123978
6451 nmdc:dobj-11-zn9eyg46 photosystem II CP43 chlorophyll apoprotein, ph... KO:K02705, KO:K02705, KO:K02705 nmdc:wfmgan-11-hyze3t58.1_07201_3_464, nmdc:wf... 28.599485
6452 nmdc:dobj-11-zn9eyg46 predicted porin COG3203 nmdc:wfmgan-11-hyze3t58.1_02099_286_783 29.142025

6453 rows × 5 columns

Final aggregated table of relative protein abundance¶

Reformat these results into a proteomic table, where each row indicates a protein and each column indicates a sample/dataset. The values within are log transformed, median normalized relative abundance values. This table or the longform version above can be used in further proteomic analyses.

In [27]:
#pivot to wide crosstab
aggregated_proteomic_output = protein_abundances.pivot(index='razor',columns='processed_DO_id',values='StatMomentsAreaLog+Norm')
aggregated_proteomic_output.columns.name = None

aggregated_proteomic_output
Out[27]:
nmdc:dobj-11-2k6s1505 nmdc:dobj-11-2szyc560 nmdc:dobj-11-301yh759 nmdc:dobj-11-3378mg86 nmdc:dobj-11-3ak2bc31 nmdc:dobj-11-4bwzhs42 nmdc:dobj-11-7psagy79 nmdc:dobj-11-7r5yea49 nmdc:dobj-11-7x21d450 nmdc:dobj-11-bxyvx506 ... nmdc:dobj-11-tk9hc081 nmdc:dobj-11-vjmd5h28 nmdc:dobj-11-wcn69x37 nmdc:dobj-11-xm1yjv87 nmdc:dobj-11-y2aefn18 nmdc:dobj-11-yrbs4v96 nmdc:dobj-11-z47wnf07 nmdc:dobj-11-z8hc0j30 nmdc:dobj-11-zhz66007 nmdc:dobj-11-zn9eyg46
razor
nmdc:wfmgan-11-21ms6426.1_000036_702_1004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-21ms6426.1_000053_3_563, nmdc:wfmgan-11-p3dkkz54.1_008013_251_1072 NaN NaN NaN NaN NaN NaN 30.499817 NaN NaN NaN ... 29.083049 32.948118 NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-21ms6426.1_000057_4146_5369, nmdc:wfmgan-11-p3dkkz54.1_002120_389_1612, nmdc:wfmgan-11-z0raab34.1_009908_3_737 NaN 30.067875 28.391404 28.935731 NaN NaN 30.458277 NaN NaN 28.7232 ... NaN 30.467756 26.462791 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
nmdc:wfmgan-11-z0raab34.1_148395_2_283 NaN 31.373664 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-z0raab34.1_160266_3_278 NaN 27.750197 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nmdc:wfmgan-11-z0raab34.1_167280_3_278 NaN 24.717223 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3442 rows × 31 columns

The generated protein table can be used as input to the software pmartR, which performs statistical analyses such as ANOVA and independence of missing data (IMD) tests. In this case, the aggregated proteomics table (aggregated_proteomic_output) would be equivalent to pmartR's e_data and the peptide to protein to gene mappings (razor_mapping) would be equivalent to pmartR's e_meta.

No description has been provided for this image

pmartR requires sample metadata to parameterize analyses and interpret the data. For this example dataset, an API call will capture the biosample metadata (sample_metadata) that would be equivalent to pmartR's f_data.

Gather biosample metadata via the nmdc_api_utilites BiosampleSeach object, using the function get_record_by_filter() in biosample_ids associated with each output in proteomic_output_df.

In [28]:
from nmdc_api_utilities.biosample_search import BiosampleSearch
# since we are querying the biosample collection, we need to create an instance of it
bs_client = BiosampleSearch(env=ENV)

# create a list of lists of ids to query the BiosampleSearch object
chunked_list = split_list(proteomic_output_df['biosample_id'].unique().tolist())
biosamples = []
# loop through the chunked list of ids
for chunk in chunked_list:
    # create the filter - query the BiosampleSearch object looking for records where the id field is in the chunk of ids from proteomic_output_df
    filter_list = dp_client._string_mongo_list(chunk)
    filter = f'{{"id": {{"$in": {filter_list}}}}}'
    # get the results
    biosamples += bs_client.get_record_by_filter(filter=filter,fields="id,depth.has_numeric_value", max_page_size=100, all_pages=True)

#normalize the json columns
biosamples = pd.json_normalize(biosamples).drop_duplicates().rename(columns={'id':'biosample_id'})

#merge biosample and processed data object information
sample_metadata = biosamples.merge(proteomic_output_df[['processed_DO_id','biosample_id']],on='biosample_id')
del biosamples

sample_metadata
Out[28]:
biosample_id depth.has_numeric_value processed_DO_id
0 nmdc:bsm-13-0jw5n594 0.0 nmdc:dobj-11-pxnhdj84
1 nmdc:bsm-13-13145k83 0.2 nmdc:dobj-11-xm1yjv87
2 nmdc:bsm-13-1p0tct86 0.2 nmdc:dobj-11-bxyvx506
... ... ... ...
28 nmdc:bsm-13-xvf06c81 0.4 nmdc:dobj-11-m808mj05
29 nmdc:bsm-13-yjp2dx65 0.1 nmdc:dobj-11-301yh759
30 nmdc:bsm-13-zms1jq91 0.5 nmdc:dobj-11-sx7cyr58

31 rows × 3 columns