Workflow
Flash: Lengthens reads
Blast: Aligns contigs to Blast database
Merge: Concatenates samples
Taxon: Performs taxonomic assignment
Trim Galore: Performs quality control for reads
Read Alignment: Aligns reads to contigs
HMMER: Performs brite analysis
The submodules of the workflow are configured in metaGenePipe.json
. The defaults are as follows:
"metaGenPipe.flashBoolean": true,
"metaGenPipe.blastBoolean": false,
"metaGenPipe.concatenateBoolean": true,
"metaGenPipe.taxonBoolean": true,
"metaGenPipe.trimmomaticBoolean": false,
"metaGenPipe.trimGaloreBoolean": true,
"metaGenPipe.megahitBoolean": true,
"metaGenPipe.mapreadsBoolean": true,
"metaGenPipe.hmmerBoolean": true,
Output
There are four main output folders: qc (quality control), assembly, readalignment, and geneprediction and one intermediary, data, which contains the samples for assembly after running through TrimGalore and concatenating the samples for co-assembly if specified.
Quality control
trimmed
{sampleName}.T{G|T}_R{1|2}.fq.gz: Trimmed output for each of the individual sample files, TG if the chosen trimmer is TrimGalore, and TT if it is Trimmomatic
fastqc
{sampleName}.T{G|T}_R{1|2}_fastqc.zip: Fastqc output for each of the individual sample files
multiqc_report.html: Combined report of all fastqc files
flash
{sampleName}.extendedFrags.fastq: The merged reads
Data
{sampleName}_R{1|2}.fq.gz Sample files after trimming and/or concatenating for co-assembly. If files are concatenated for co-assembly, the sample name is set to be combined
Assembly
{sampleName}.megahit.contigs.fa: Final assembled contigs
{sampleName}.{kmer}.fastg: Assembly graph for {kmer} assembled contigs, where {kmer} produces the largest assembled contig file size in the intermediate_contigs folder
intermediate_contigs: a folder containing all intermediate assembled contigs {sampleName}.contigs.k{kmer}.fastg
{sampleName}.megahit.blast.out: Raw blast results for the contigs
{sampleName}.megahit.blast.parsed: Blast results parsed to be easily viewed in tsv format
Read alignment
{sampleName}.T{G|T}.flagstat.txt: Samtools flagstat output. Reports statistics on alignment of reads back to assembled contigs
{sampleName}.T{G|T}.sam: Alignment of reads back to contigs in SAM format
{sampleName}.T{G|T}.sorted.bam: Alignment of reads back to contigs in BAM format
Gene prediction
{sampleName}.megahit.proteins.fa.xml.out.xml: XML output of alignment of predicted Amino Acids to NCBI database (We chose swissprot, but any blast database can be substituted)
diamond
{sampleName}.megahit.proteins.fa.xml.out:
hmmer
{sampleName}.megahit.proteins.hmmer.out: Raw hmmer output aligned to Koalafam profiles
{sampleName}.megahit.proteins.hmmer.tblout: Parsed hmmer output aligned to Koalafam profiles
prodigal
{sampleName}.megahit.gene_coordinates.gbk: Gene coordinates file (Genbank like file)
{sampleName}.megahit.nucl_genes.fa: Predicted gene nucleotide sequences
{sampleName}.megahit.proteins.fa: Predicted gene amino acid sequences
{sampleName}.megahit.starts.txt: Prodigal starts file
Classification
taxon - These files are produced for each sample (pair of read files) if the inputs are assembled separately (as opposed to co-assembly).
LevelA.brite.counts.tsv: Level A Kegg Brite Hierarchical gene count
LevelB.brite.counts.tsv: Level B Kegg Brite Hierarchical gene count
LevelC.brite.counts.tsv: Level C Kegg Brite Hierarchical gene count
OTU.brite.tsv: Table with counts of taxonomic (organism) IDs of genes
Output Tree
Below is an example tree of the the output directory:
.
├── assembly
│ ├── combined.57.fastg
│ ├── combined.megahit.blast.out
│ ├── combined.megahit.blast.parsed
│ ├── combined.megahit.contigs.fa
│ └── intermediate_contigs
│ ├── combined.contigs.k27.fa
│ ├── combined.contigs.k37.fa
│ ├── combined.contigs.k47.fa
│ ├── combined.contigs.k57.fa
│ ├── combined.contigs.k67.fa
│ ├── combined.contigs.k77.fa
│ ├── combined.contigs.k87.fa
│ └── combined.contigs.k97.fa
├── data
│ ├── combined_R1.fq.gz
│ └── combined_R2.fq.gz
├── geneprediction
│ ├── combined.megahit.proteins.fa.xml.out.xml
│ ├── diamond
│ │ └── combined.megahit.proteins.fa.xml.out
│ ├── hmmer
│ │ ├── combined.megahit.proteins.hmmer.out
│ │ └── combined.megahit.proteins.hmmer.tblout
│ └── prodigal
│ ├── combined.megahit.gene_coordinates.gbk
│ ├── combined.megahit.nucl_genes.fa
│ ├── combined.megahit.proteins.fa
│ └── combined.megahit.starts.txt
├── qc
│ ├── fastqc
│ │ ├── SRR5808831.TG_R1_fastqc.zip
│ │ ├── SRR5808831.TG_R2_fastqc.zip
│ │ ├── SRR5808882.TG_R1_fastqc.zip
│ │ └── SRR5808882.TG_R2_fastqc.zip
│ ├── flash
│ │ ├── SRR5808831.extendedFrags.fastq
│ │ └── SRR5808882.extendedFrags.fastq
│ ├── multiqc_report.html
│ └── trimmed
│ ├── SRR5808831.TG_R1.fq.gz
│ ├── SRR5808831.TG_R2.fq.gz
│ ├── SRR5808882.TG_R1.fq.gz
│ └── SRR5808882.TG_R2.fq.gz
├── readalignment
│ ├── SRR5808831.TG.flagstat.txt
│ ├── SRR5808831.TG.sam
│ ├── SRR5808831.TG.sorted.bam
│ ├── SRR5808882.TG.flagstat.txt
│ ├── SRR5808882.TG.sam
│ └── SRR5808882.TG.sorted.bam
└── taxon
├── LevelA.brite.counts.tsv
├── LevelB.brite.counts.tsv
├── LevelC.brite.counts.tsv
└── OTU.brite.tsv