Metaproteomic Workflow (v1.0.0)

Summary

The metaproteomics workflow/pipeline is an end-to-end data processing workflow for protein identification and characterization using MS/MS data. Briefly, mass spectrometry instrument generated data files(.RAW) are converted to mzML, an open data format, using MSConvert. Peptide identification is achieved using MSGF+ and the associated metagenomic information in the FASTA (protein sequences) file format. Intensity information for identified species is extracted using MASIC and combined with protein information.

Workflow Diagram

Workflow Dependencies

Third party software

|----------------------------|------------------------------------------|
| MSGFPlus                   | v20190628                                |
| Mzid-To-Tsv-Converter      | v1.3.3                                   |
| PeptideHitResultsProcessor | v1.5.7130                                |
| pwiz-bin-windows           | x86_64-vc141-release-3_0_20149_b73158966 |
| MASIC                      | v3.0.7235                                |
| sqlite-netFx-full-source   | 1.0.111.0                                |
| Conda                      | (3-clause BSD)                           |
|                            |                                          |

Workflow Availability

The workflow is available in GitHub: https://github.com/microbiomedata/metaPro

The container is available at Docker Hub (microbiomedata/mepro): https://hub.docker.com/r/microbiomedata/mepro

Inputs

.raw, metagenome, parameter files : MSGFplus & MASIC, contaminant_file

Outputs

Processing multiple datasets.

.
├── Data/
├── FDR_table.csv
├── Plots/
├── dataset_job_map.csv
├── peak_area_crosstab_by_dataset_id.csv
├── protein_peptide_map.csv
├── specID_table.csv
└── spectra_count_crosstab_by_dataset_id.csv

Processing single FICUS dataset.

metadatafile, [Example](https://jsonblob.com/400362ef-c70c-11ea-bf3d-05dfba40675b)

| Keys               | Values                                                                   |
|--------------------|--------------------------------------------------------------------------|
| id                 | str: "md5 hash of $github_url+$started_at_time+$ended_at_time"           |
| name               | str: "Metagenome:$proposal_extid_$sample_extid:$sequencing_project_extid |
| was_informed_by    | str: "GOLD_Project_ID"                                                   |
| started_at_time    | str: "metaPro start-time"                                                |
| ended_at_time      | str: "metaPro end-time"                                                  |
| type               | str: tag: "nmdc:metaPro"                                                 |
| execution_resource | str: infrastructure name to run metaPro                                  |
| git_url            | str: "url to a release"                                                  |
| dataset_id         | str: "dataset's unique-id at EMSL"                                       |
| dataset_name       | str: "dataset's name at EMSL"                                            |
| has_inputs         | json_obj                                                                 |
| has_outputs        | json_obj                                                                 |
| stats              | json_obj                                                                 |

has_inputs :
| MSMS_out         | str: file_name \|file_size \|checksum                                                                                     |
| metagenome_file  | str: file_name \|file_size \|checksum \|
                     int: entry_count(#of gene sequences) \|
                     int: duplicate_count(#of duplicate gene sequences) |
| parameter_files  | str: for_masic/for_msgfplus : file_name \|file_size \|checksum
                     parameter file used for peptide identification search
| Contaminant_file | str: file_name \|file_size \|checksum
                     (FASTA containing common contaminants in proteomics)

has_outputs:
| collapsed_fasta_file | str: file_name \|file_size \|checksum                                           |
| resultant_file       | str: file_name \|file_size \|checksum                                           |
| data_out_table       | str: file_name \|file_size \|checksum                                           |

stats:
| from_collapsed_fasta | int: entry_count(#of unique gene sequences)                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| from_resultant_file  | int: total_protein_count                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| from_data_out_table  | int: PSM(# of MS/MS spectra matched to a peptide sequence at 5% false discovery rate (FDR)
                         float: PSM_identification_rate(# of peptide matching MS/MS spectra divided by total spectra searched (5% FDR)
                         int: unique_peptide_seq_count(# of unique peptide sequences observed in pipeline analysis 5% FDR)
                         int: first_hit_protein_count(# of proteins observed assuming single peptide-to-protein relationships)
                         int: mean_peptide_count(Unique peptide sequences matching to each identified protein.)

data_out_table

| DatasetName | PeptideSequence | FirstHitProtein | SpectralCount | sum(MasicAbundance) | GeneCount | FullGeneList | FirstHitDescription | DescriptionList | min(Qvalue) |

collapsed_fasta_file
resultant_file

Requirements for Execution

Docker or other Container Runtime

Version History

1.0.0

Point of contact

Package maintainer: Anubhav <anubhav@pnnl.gov>