Workflow

Flash: Lengthens reads
Blast: Aligns contigs to Blast database
Merge: Concatenates samples
Taxon: Performs taxonomic assignment
Trim Galore: Performs quality control for reads
Read Alignment: Aligns reads to contigs
HMMER: Performs brite analysis

The submodules of the workflow are configured in metaGenePipe.json. The defaults are as follows:

"metaGenPipe.flashBoolean": true,
"metaGenPipe.blastBoolean": false,
"metaGenPipe.concatenateBoolean": true,
"metaGenPipe.taxonBoolean": true,
"metaGenPipe.trimmomaticBoolean": false,
"metaGenPipe.trimGaloreBoolean": true,
"metaGenPipe.megahitBoolean": true,
"metaGenPipe.mapreadsBoolean": true,
"metaGenPipe.hmmerBoolean": true,

Output

There are four main output folders: qc (quality control), assembly, readalignment, and geneprediction and one intermediary, data, which contains the samples for assembly after running through TrimGalore and concatenating the samples for co-assembly if specified.

Quality control

trimmed
- {sampleName}.T{G|T}_R{1|2}.fq.gz: Trimmed output for each of the individual sample files, TG if the chosen trimmer is TrimGalore, and TT if it is Trimmomatic
fastqc
- {sampleName}.T{G|T}_R{1|2}_fastqc.zip: Fastqc output for each of the individual sample files
multiqc_report.html: Combined report of all fastqc files
flash
- {sampleName}.extendedFrags.fastq: The merged reads

Data

{sampleName}_R{1|2}.fq.gz Sample files after trimming and/or concatenating for co-assembly. If files are concatenated for co-assembly, the sample name is set to be combined

Assembly

{sampleName}.megahit.contigs.fa: Final assembled contigs
{sampleName}.{kmer}.fastg: Assembly graph for {kmer} assembled contigs, where {kmer} produces the largest assembled contig file size in the intermediate_contigs folder
intermediate_contigs: a folder containing all intermediate assembled contigs {sampleName}.contigs.k{kmer}.fastg
{sampleName}.megahit.blast.out: Raw blast results for the contigs
{sampleName}.megahit.blast.parsed: Blast results parsed to be easily viewed in tsv format

Read alignment

{sampleName}.T{G|T}.flagstat.txt: Samtools flagstat output. Reports statistics on alignment of reads back to assembled contigs
{sampleName}.T{G|T}.sam: Alignment of reads back to contigs in SAM format
{sampleName}.T{G|T}.sorted.bam: Alignment of reads back to contigs in BAM format

Gene prediction

{sampleName}.megahit.proteins.fa.xml.out.xml: XML output of alignment of predicted Amino Acids to NCBI database (We chose swissprot, but any blast database can be substituted)
diamond
- {sampleName}.megahit.proteins.fa.xml.out:
hmmer
- {sampleName}.megahit.proteins.hmmer.out: Raw hmmer output aligned to Koalafam profiles
- {sampleName}.megahit.proteins.hmmer.tblout: Parsed hmmer output aligned to Koalafam profiles
prodigal
- {sampleName}.megahit.gene_coordinates.gbk: Gene coordinates file (Genbank like file)
- {sampleName}.megahit.nucl_genes.fa: Predicted gene nucleotide sequences
- {sampleName}.megahit.proteins.fa: Predicted gene amino acid sequences
- {sampleName}.megahit.starts.txt: Prodigal starts file

Classification

taxon - These files are produced for each sample (pair of read files) if the inputs are assembled separately (as opposed to co-assembly).
- LevelA.brite.counts.tsv: Level A Kegg Brite Hierarchical gene count
- LevelB.brite.counts.tsv: Level B Kegg Brite Hierarchical gene count
- LevelC.brite.counts.tsv: Level C Kegg Brite Hierarchical gene count
- OTU.brite.tsv: Table with counts of taxonomic (organism) IDs of genes

Output Tree

Below is an example tree of the the output directory:

.
├── assembly
│   ├── combined.57.fastg
│   ├── combined.megahit.blast.out
│   ├── combined.megahit.blast.parsed
│   ├── combined.megahit.contigs.fa
│   └── intermediate_contigs
│       ├── combined.contigs.k27.fa
│       ├── combined.contigs.k37.fa
│       ├── combined.contigs.k47.fa
│       ├── combined.contigs.k57.fa
│       ├── combined.contigs.k67.fa
│       ├── combined.contigs.k77.fa
│       ├── combined.contigs.k87.fa
│       └── combined.contigs.k97.fa
├── data
│   ├── combined_R1.fq.gz
│   └── combined_R2.fq.gz
├── geneprediction
│   ├── combined.megahit.proteins.fa.xml.out.xml
│   ├── diamond
│   │   └── combined.megahit.proteins.fa.xml.out
│   ├── hmmer
│   │   ├── combined.megahit.proteins.hmmer.out
│   │   └── combined.megahit.proteins.hmmer.tblout
│   └── prodigal
│       ├── combined.megahit.gene_coordinates.gbk
│       ├── combined.megahit.nucl_genes.fa
│       ├── combined.megahit.proteins.fa
│       └── combined.megahit.starts.txt
├── qc
│   ├── fastqc
│   │   ├── SRR5808831.TG_R1_fastqc.zip
│   │   ├── SRR5808831.TG_R2_fastqc.zip
│   │   ├── SRR5808882.TG_R1_fastqc.zip
│   │   └── SRR5808882.TG_R2_fastqc.zip
│   ├── flash
│   │   ├── SRR5808831.extendedFrags.fastq
│   │   └── SRR5808882.extendedFrags.fastq
│   ├── multiqc_report.html
│   └── trimmed
│       ├── SRR5808831.TG_R1.fq.gz
│       ├── SRR5808831.TG_R2.fq.gz
│       ├── SRR5808882.TG_R1.fq.gz
│       └── SRR5808882.TG_R2.fq.gz
├── readalignment
│   ├── SRR5808831.TG.flagstat.txt
│   ├── SRR5808831.TG.sam
│   ├── SRR5808831.TG.sorted.bam
│   ├── SRR5808882.TG.flagstat.txt
│   ├── SRR5808882.TG.sam
│   └── SRR5808882.TG.sorted.bam
└── taxon
    ├── LevelA.brite.counts.tsv
    ├── LevelB.brite.counts.tsv
    ├── LevelC.brite.counts.tsv
    └── OTU.brite.tsv