Metaproteomic Workflow (v1.0.0)
Summary
The metaproteomics workflow/pipeline is an end-to-end data processing workflow for protein identification and characterization using MS/MS data. Briefly, mass spectrometry instrument generated data files(.RAW) are converted to mzML, an open data format, using MSConvert. Peptide identification is achieved using MSGF+ and the associated metagenomic information in the FASTA (protein sequences) file format. Intensity information for identified species is extracted using MASIC and combined with protein information.
Workflow Diagram
Workflow Dependencies
Third party software
|----------------------------|------------------------------------------|
| MSGFPlus | v20190628 |
| Mzid-To-Tsv-Converter | v1.3.3 |
| PeptideHitResultsProcessor | v1.5.7130 |
| pwiz-bin-windows | x86_64-vc141-release-3_0_20149_b73158966 |
| MASIC | v3.0.7235 |
| sqlite-netFx-full-source | 1.0.111.0 |
| Conda | (3-clause BSD) |
| | |
Workflow Availability
The workflow is available in GitHub: https://github.com/microbiomedata/metaPro
The container is available at Docker Hub (microbiomedata/mepro): https://hub.docker.com/r/microbiomedata/mepro
Inputs
.raw, metagenome, parameter files : MSGFplus & MASIC, contaminant_file
Outputs
Processing multiple datasets.
.
├── Data/
├── FDR_table.csv
├── Plots/
├── dataset_job_map.csv
├── peak_area_crosstab_by_dataset_id.csv
├── protein_peptide_map.csv
├── specID_table.csv
└── spectra_count_crosstab_by_dataset_id.csv
Processing single FICUS dataset.
metadatafile, [Example](https://jsonblob.com/400362ef-c70c-11ea-bf3d-05dfba40675b)
| Keys | Values |
|--------------------|--------------------------------------------------------------------------|
| id | str: "md5 hash of $github_url+$started_at_time+$ended_at_time" |
| name | str: "Metagenome:$proposal_extid_$sample_extid:$sequencing_project_extid |
| was_informed_by | str: "GOLD_Project_ID" |
| started_at_time | str: "metaPro start-time" |
| ended_at_time | str: "metaPro end-time" |
| type | str: tag: "nmdc:metaPro" |
| execution_resource | str: infrastructure name to run metaPro |
| git_url | str: "url to a release" |
| dataset_id | str: "dataset's unique-id at EMSL" |
| dataset_name | str: "dataset's name at EMSL" |
| has_inputs | json_obj |
| has_outputs | json_obj |
| stats | json_obj |
has_inputs :
| MSMS_out | str: file_name \|file_size \|checksum |
| metagenome_file | str: file_name \|file_size \|checksum \|
int: entry_count(#of gene sequences) \|
int: duplicate_count(#of duplicate gene sequences) |
| parameter_files | str: for_masic/for_msgfplus : file_name \|file_size \|checksum
parameter file used for peptide identification search
| Contaminant_file | str: file_name \|file_size \|checksum
(FASTA containing common contaminants in proteomics)
has_outputs:
| collapsed_fasta_file | str: file_name \|file_size \|checksum |
| resultant_file | str: file_name \|file_size \|checksum |
| data_out_table | str: file_name \|file_size \|checksum |
stats:
| from_collapsed_fasta | int: entry_count(#of unique gene sequences) |
| from_resultant_file | int: total_protein_count |
| from_data_out_table | int: PSM(# of MS/MS spectra matched to a peptide sequence at 5% false discovery rate (FDR)
float: PSM_identification_rate(# of peptide matching MS/MS spectra divided by total spectra searched (5% FDR)
int: unique_peptide_seq_count(# of unique peptide sequences observed in pipeline analysis 5% FDR)
int: first_hit_protein_count(# of proteins observed assuming single peptide-to-protein relationships)
int: mean_peptide_count(Unique peptide sequences matching to each identified protein.)
data_out_table
| DatasetName | PeptideSequence | FirstHitProtein | SpectralCount | sum(MasicAbundance) | GeneCount | FullGeneList | FirstHitDescription | DescriptionList | min(Qvalue) |
collapsed_fasta_file
resultant_file
Requirements for Execution
Docker or other Container Runtime
Version History
1.0.0
Point of contact
Package maintainer: Anubhav <anubhav@pnnl.gov>