This document describes the output produced by the pipeline. Most of the plots are taken from the AssemblyQC report which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
The pipeline is built using Nextflow and processes data to produce following outputs:
- Format validation
- Assemblathon stats
- gfastats
- NCBI FCS-adaptor
- NCBI FCS-GX
- tidk
- BUSCO
- LAI
- Kraken 2
- HiC contact map
- Merqury
- Synteny
- GenomeTools gt stat
- OrthoFinder
- Pipeline information
The pipeline prints a warning in the pipeline log if FASTA or GFF3 validation fails. The error log from the validator is reported in the report.html
. The remaining QC tools are skipped for the assembly with invalid fasta file.
Output files
assemblathon_stats/
*_stats.csv
: Assembly stats in CSV format.
assemblathon_stats.pl
is a script which calculate a basic set of metrics from a genome assembly.
Warning
Contig-related stats are based on the assumption that assemblathon_stats_n_limit
is specified correctly. If you are not certain of the value of assemblathon_stats_n_limit
, please ignore the contig-related stats.
Output files
gfastats/
*.assembly_summary
: Assembly stats in TSV format.
gfastats is a fast and exhaustive tool for summary statistics.
Output files
ncbi_fcs_adaptor/
*_fcs_adaptor_report.tsv
: NCBI FCS-adaptor report in CSV format.
FCS-adaptor detects adaptor and vector contamination in genome sequences.
Output files
ncbi_fcs_gx/
*.taxonomy.rpt
: Taxonomy report.*.fcs_gx_report.txt
: A final report of recommended actions.*.inter.tax.rpt.tsv
: Select columns from*.taxonomy.rpt
used for generation of a Krona taxonomy plot.*.fcs.gx.krona.cut
: Taxonomy file for Krona plot created from*.inter.tax.rpt.tsv
.*.fcs.gx.krona.html
: Interactive Krona taxonomy plot.
FCS-GX detects contamination from foreign organisms in genome sequences.
Output files
tidk/
*.apriori.tsv
: Frequencies for successive windows in forward and reverse directions for the pre-specified telomere-repeat sequence.*.apriori.svg
: Plot of*.apriori.tsv
*.tidk.explore.tsv
: List of the most frequent repeat sequences.*.top.sequence.txt
: The top sequence from*.tidk.explore.tsv
.*.aposteriori.tsv
: Frequencies for successive windows in forward and reverse directions for the top sequence from*.top.sequence.txt
.*.aposteriori.svg
: Plot of*.aposteriori.tsv
.
tidk toolkit is designed to identify and visualize telomeric repeats for the Darwin Tree of Life genomes.
Output files
busco/
fasta
busco_figure.png
: Summary figure created from all the BUSCO summaries.tag
short_summary.specific.*_odb10.tag_*.txt
: BUSCO summary for the assembly represented bytag
.
gff
busco_figure.png
: Summary figure created from all the BUSCO summaries.tag
short_summary.specific.*_odb10.tag_*.txt
: BUSCO summary for the annotation of the assembly represented bytag
.
BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs.
Output files
lai/
*.LAI.log
: Log file from LAI*.LAI.out
: Output file from LAI which lists assembly index by contig and for the whole genome.*.LTRlib.fa
: Long terminal repeat library generated by LTR_retriever.*.restored.ids.gff3
: Long terminal repeat annotation generated by LTR_retriever.*.short.ids.tsv
: LTR_retriever and LAI require that the assembly sequence IDs are alphanumeric and not more than 13 characters long. If needed, the pipeline shortens these IDS. The new and original IDs are listed in this TSV file.
LTR Assembly Index (LAI) is a reference-free genome metric that evaluates assembly continuity using LTR-RTs. LTR retrotransposons (LTR-RTs) are the predominant interspersed repeat that is poorly assembled in draft genomes. Correcting for LTR-RT amplification dynamics, LAI is independent of genome size, genomic LTR-RT content, and gene space evaluation metrics such as BUSCO. LAI = Raw LAI + 2.8138 × (94 – whole genome LTR identity). The LAI is set to 0 when raw LAI = 0 or the adjustment produces a negative value. Raw LAI = (Intact LTR element length / Total LTR sequence length) * 100
Warning
Soft masked regions are unmasked when calculating LAI. However, hard masked regions are left as is. The pipeline will fail to calculate LAI if all the LTRs are already hard masked.
Output files
kraken2/
*.kraken2.report
: Kraken 2 report.*.kraken2.cut
: Kraken 2 output.*.kraken2.krona.cut
: Select columns from*.kraken2.cut
used for generation of a Krona taxonomy plot.*.kraken2.krona.html
: Interactive Krona taxonomy plot.
Kraken 2 assigns taxonomic labels to sequencing reads for metagenomics projects. Further reading regarding performance of Kraken 2: https://doi.org/10.1099/mgen.0.000949
Output files
hic/
fastqc_raw/
*_1_fastqc.html/*_2_fastqc.html
: FastQC html report for the raw reads*_1_fastqc.zip/*_2_fastqc.zip
: FastQC stats for the raw reads
fastp/
*.fastp.html
: fastp HTML report*.fastp.json
: fastp statistics in JSON format*.fastp.log
: fastp log*_1.fastp.fastq.gz/*_2.fastp.fastq.gz
: Reads passed by fastp*_1.fail.fastq.gz/*_2.fail.fastq.gz
: Reads failed by fastp
fastqc_trim/
*_1_fastqc.html/*_2_fastqc.html
: FastQC html report for the reads passed by FASTP.*_1_fastqc.zip/*_2_fastqc.zip
: FastQC stats for the reads passed by FASTP.
hicqc
*.on.*_qc_report.pdf
: HiC QC report for reads mapped to an assembly.
assembly/
*.agp.assembly
: AGP assembly file listing the length of each contig in the assembly.
bedpe/
-*.assembly.bedpe
:*.agp.assembly
file converted to BEDPE to highlight the contigs on the HiC contact map.
Hi-C contact mapping experiments measure the frequency of physical contact between loci in the genome. The resulting dataset, called a “contact map,” is represented using a two-dimensional heatmap where the intensity of each pixel indicates the frequency of contact between a pair of loci.
Output files
merqury/
tag1-and-tag2
: Results folder for haplotypetag1
andtag2
.*.completeness.stats
: Assembly completeness statistics*.qv
: Assembly consensus quality QV statistics*.fl.png
: Spectra plots*.hapmers.blob.png
: Hap-mer blob plot
Merqury is used for the k-mer analysis.
Output files
synteny/
*.*.all/
: Synteny files corresponding to all contigs of the target assembly with respect to all contig of the reference assembly.*.on.*.all.png/svg
: Synteny plot generated with CIRCOS*.on.*.all.html
: Synteny dotplot generated with Plotlybundled.links.tsv
: Bundled links file generated with MUMMER,MUMMER/dnadiff.pl
and bundlelinks.py.circos.conf
: CIRCOS configuration file used to generate the synteny plot.karyotype.tsv
: Karyotype TSV file used to generate the synteny plot.
*.on.*.*
: Synteny files corresponding to a single contig of the target assembly with respect to all contigs of the reference assembly.plotsr
: Plotsr files*.error.log
: Error log for the failed Syri comparison*.plotsr.csv
: CSV file listing sequence IDs and labels used by plotsrplotsr.png
: Plotsr synteny plot
Circos and dotplots are created from genome-wide alignments performed with MUMmer. Whereas, Plotsr plots are created from genome-wide alignments performed with Minimap2.
Output files
genometools_gt_stat/
*.gt.stat.yml
: Assembly annotation stats in yaml format.
GenomeTools gt stat
tool calculates a basic set of statistics about features contained in GFF3 files.
Output files
orthofinder/assemblyqc
: OrthoFinder output folder.
If more than one assemblies are included along with their annotations, OrthoFinder is executed on the annotation proteins to perform a phylogenetic orthology inference for comparative genomics.
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.html
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter's are used when running the pipeline. - Parameters used by the pipeline run:
params.json
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.