Metagenome Assembled Genomes Workflow (v1.0.2)

Workflow Overview

The workflow is based on IMG metagenome binning pipeline and has been modified specifically for the NMDC project. For all processed metagenomes, it classifies contigs into bins using MetaBat2. Next, the bins are refined using the functional Annotation file (GFF) from the Metagenome Annotation workflow and optional contig lineage information. The completeness of and the contamination present in the bins are evaluated by CheckM and bins are assigned a quality level (High Quality (HQ), Medium Quality (MQ), Low Quality (LQ)) based on MiMAG standards. In the end, GTDB-Tk is used to assign lineage for HQ and MQ bins.

Workflow Availability

The workflow from GitHub uses all the listed docker images to run all third-party tools. The workflow is available in GitHub: https://github.com/microbiomedata/metaMAGs The corresponding Docker image is available in DockerHub: https://hub.docker.com/r/microbiomedata/nmdc_mbin

Requirements for Execution

(recommendations are in bold):

WDL-capable Workflow Execution Tool (Cromwell)
Container Runtime that can load Docker images (Docker v2.1.0.3 or higher)

Hardware Requirements

Disk space: > 27 GB for the CheckM and GTDB-Tk databases
Memory: ~120GB memory for GTDB-tk.

Workflow Dependencies

Third party software (These are included in the Docker image.)

Requisite databases

The GTDB-Tk database must be downloaded and installed. The CheckM database included in the Docker image is a 275MB file contains the databases used for the Metagenome Binned contig quality assessment. The GTDB-Tk (27GB) database is used to assign lineages to the binned contigs.

The following commands will download and unarchive the GTDB-Tk database:

wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/gtdbtk_r89_data.tar.gz
tar -xvzf gtdbtk_r89_data.tar.gz
mv release89 GTDBTK_DB
rm gtdbtk_r89_data.tar.gz

Sample dataset(s)

The following test dataset include an assembled contigs file, a BAM file, and a functional annotation file: metaMAGs_test_dataset.tgz

Input

A JSON file containing the following:

the number of CPUs requested
The number of threads used by pplacer (Use lower number to reduce the memory usage)
the path to the output directory
the project name
the path to the Metagenome Assembled Contig fasta file (FNA)
the path to the Sam/Bam file from read mapping back to contigs (SAM.gz or BAM)
the path to contigs functional annotation result (GFF)
the path to the text file which contains mapping of headers between SAM or BAM and GFF (ID in SAM/FNA<tab>ID in GFF)
the path to the database directory which includes checkM_DB and GTDBTK_DB subdirectories.
(optional) scratch_dir: use –scratch_dir for gtdbtk disk swap to reduce memory usage but longer runtime

An example JSON file is shown below:

{
    "nmdc_mags.cpu":32,
    "nmdc_mags.pplacer_cpu":1,
    "nmdc_mags.outdir":"/path/to/output",
    "nmdc_mags.proj_name":" Ga0482263",
    "nmdc_mags.contig_file":"/path/to/Ga0482263_contigs.fna ",
    "nmdc_mags.sam_file":"/path/to/pairedMapped_sorted.bam ",
    "nmdc_mags.gff_file":"/path/to/Ga0482263_functional_annotation.gff",
    "nmdc_mags.map_file":"/path/to/Ga0482263_contig_names_mapping.tsv",
    "nmdc_mags.gtdbtk_database":"/path/to/GTDBTK_DB"
    "nmdc_mags.scratch_dir":"/path/to/scratch_dir"
}

Output

The workflow creates several output directories with many files. The main output files, the binned contig files from HQ and MQ bins, are in the hqmq-metabat-bins directory; the corresponding lineage results for the HQ and MQ bins are in the gtdbtk_output directory.

A partial JSON output file is shown below:

|-- MAGs_stats.json
|-- 3300037552.bam.sorted
|-- 3300037552.depth
|-- 3300037552.depth.mapped
|-- bins.lowDepth.fa
|-- bins.tooShort.fa
|-- bins.unbinned.fa
|-- checkm-out
|   |-- bins/
|   |-- checkm.log
|   |-- lineage.ms
|   `-- storage
|-- checkm_qa.out
|-- gtdbtk_output
|   |-- align/
|   |-- classify/
|   |-- identify/
|   |-- gtdbtk.ar122.classify.tree -> classify/gtdbtk.ar122.classify.tree
|   |-- gtdbtk.ar122.markers_summary.tsv -> identify/gtdbtk.ar122.markers_summary.tsv
|   |-- gtdbtk.ar122.summary.tsv -> classify/gtdbtk.ar122.summary.tsv
|   |-- gtdbtk.bac120.classify.tree -> classify/gtdbtk.bac120.classify.tree
|   |-- gtdbtk.bac120.markers_summary.tsv -> identify/gtdbtk.bac120.markers_summary.tsv
|   |-- gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
|   `-- ..etc
|-- hqmq-metabat-bins
|   |-- bins.11.fa
|   |-- bins.13.fa
|   `-- ... etc
|-- mbin-2020-05-24.sqlite
|-- mbin-nmdc.20200524.log
|-- metabat-bins
|   |-- bins.1.fa
|   |-- bins.10.fa
|   `-- ... etc

Below is an example of all the output directory files with descriptions to the right.

FileName/DirectoryName	Description
1781_86104.bam.sorted	sorted input bam file
1781_86104.depth	the contig depth coverage
1781_86104.depth.mapped	the name mapped contig depth coverage
MAGs_stats.json	MAGs statistics in json format
bins.lowDepth.fa	lowDepth (mean cov <1 ) filtered contigs fasta file by metaBat2
bins.tooShort.fa	tooShort (< 3kb) filtered contigs fasta file by metaBat2
bins.unbinned.fa	unbinned fasta file
metabat-bins/	initial metabat2 binning result fasta output directory
checkm-out/bins/	hmm and marker genes analysis result directory for each bin
checkm-out/checkm.log	checkm run log file
checkm-out/lineage.ms	lists the markers used to assign taxonomy and the taxonomic level to which the bin
checkm-out/storage/	intermediate file directory
checkm_qa.out	checkm statistics report
hqmq-metabat-bins/	HQ and MQ bins contigs fasta files directory
gtdbtk_output/identify/	gtdbtk marker genes identify result directory
gtdbtk_output/align/	gtdbtk genomes alignment result directory
gtdbtk_output/classify/	gtdbtk genomes classification result directory
gtdbtk_output/gtdbtk.ar122.classify.tree	archaeal reference tree in Newick format containing analyzed genomes (bins)
gtdbtk_output/gtdbtk.ar122.markers_summary.tsv	summary tsv file for gtdbtk marker genes identify from the archaeal 122 marker set
gtdbtk_output/gtdbtk.ar122.summary.tsv	summary tsv file for gtdbtk archaeal genomes (bins) classification
gtdbtk_output/gtdbtk.bac120.classify.tree	bacterial reference tree in Newick format containing analyzed genomes (bins)
gtdbtk_output/gtdbtk.bac120.markers_summary.tsv	summary tsv file for gtdbtk marker genes identify from the bacterial 120 marker set
gtdbtk_output/gtdbtk.bac120.summary.tsv	summary tsv file for gtdbtk bacterial genomes (bins) classification
gtdbtk_output/gtdbtk.bac120.filtered.tsv	a list of genomes with an insufficient number of amino acids in MSA
gtdbtk_output/gtdbtk.bac120.msa.fasta	the MSA of the user genomes (bins) and the GTDB genomes
gtdbtk_output/gtdbtk.bac120.user_msa.fasta	the MSA of the user genomes (bins) only
gtdbtk_output/gtdbtk.translation_table_summary.tsv	the translation table determined for each sgenome (bins)
gtdbtk_output/gtdbtk.warnings.log	gtdbtk warning message log
mbin-2021-01-31.sqlite	sqlite db file stores MAGs metadata and statistics
mbin-nmdc.20210131.log	the mbin-nmdc pipeline run log file
rc	cromwell script sbumit return code
script	Task run commands
script.background	Bash script to run script.submit
script.submit	cromwell submit commands
stderr	standard error where task writes error message to
stderr.background	standard error where bash script writes error message to
stdout	standard output where task writes error message to
stdout.background	standard output where bash script writes error message to
complete.mbin	the dummy file to indicate the finish of the pipeline

Version History

1.0.2 (release date 02/24/2021; previous versions: 1.0.1)

Point of contact

Original author: Neha Varghese <njvarghese@lbl.gov>
Package maintainer: Chienchi Lo <chienchi@lanl.gov>