Skip to content

Remove chimeric reads

Gavin Douglas edited this page Jun 21, 2017 · 24 revisions

Reference-based chimera checking

Our script chimera_filter.pl wraps VSEARCH, which implements the uchime algorithm, to remove chimeric reads.

Here is an example command:

chimera_filter.pl -type 1 -db /home/shared/rRNA_db/RDP_trainset16_022016.fa fasta_files/*

Where "-type 1" means that any reads clearly called as chimeric AND reads that are ambiguous are filtered out.

Note that a DB file needs to be input as well. You can download the versions we use from our requirements page.

Note that it is possible that the settings of "mindiv" and "minh" (see here for details) could have significant effects on results. However, so far we have found that small adjustments in these parameters has only a minor effect on sensitivity and specificity when running chimera checking for 16S sequences.

If you'd like to use USEARCH v6.1 (this is not open-source, but you can look into the license for a 32-bit version at the USEARCH website) instead of VSEARCH, you can use the older version of our script called "chimera_filter_usearch61.pl" (the options are mainly the same).

Options:

  • -h, --help
    Displays the entire help documentation.

  • -v, --version
    Displays version number and exits.

  • --type <[0|1]>
    Non-chimeric output type, either only sequences that are clearly non-chimeric (1) or all sequences that are not called as chimeric ( 0 - includes borderline sequences, "?" in uchime output).

  • --mindiv < float >
    Min % divergence between query and target sequence (default 1.5, note that this differs from the uchime default of 0.8).

  • --minh < float >
    Min score to be called as chimeric (default 0.2, note that this differs from the uchime default of 0.28).

  • -o, --out_dir
    Output directory for filtered fastq files. Default is "non_chimeras".

  • --thread <# of CPUs>
    Using this option without a value will use all CPUs on machine, while giving it a value will limit to that many CPUs. Without option only one CPU is used.

  • --keep
    Flag to indicate that temporary log files should not be removed (useful for troubleshooting). Also, will prevent the "nonchimera" and "unclear" specific fastas from being removed when type == 0.

  • -log
    The location to write the log file. Default is "chimera_filter.log".

  • -db, --database
    Database of 16S sequences to use as a reference (FASTA file).

De novo chimera checking

If you would prefer to run de novo chimera checking this is also simple to do with VSEARCH, based on the UCHIME de novo algorithm. We don't have a wrapper script to run the commands, but they can be easily run with GNU parallel (see our basic tutorial for basic usage). The catch is that there will be raw logfiles for each FASTA rather than a summary logfile as above.

Before running de novo chimera detection it's important to run dereplication of your reads with VSEARCH. If you skip this step then you'll get 0 chimeric reads identified (as described on the VSEARCH google group here and here). Dereplication is required since identifying chimeras is based on finding possible parents that by default have at least twice the abundance of the chimeras (note that this option and several other chimera detection options can be set by the user). You'll need to rereplicate your sequences before continuing with the standard Microbiome Helper pipeline.

First make the output folders.

mkdir fasta_files_derep
mkdir non_denovo_chimeras
mkdir non_denovo_chimeras_rerep

Then run dereplication with GNU parallel.

parallel -j 4 'vsearch --derep_fulllength {} --sizeout --output fasta_files_derep/{/.}.derep.fasta' ::: fasta_files/*fasta

Run the uchime_denovo algorithm with GNU parallel.

parallel --eta -j 4 'vsearch --uchime_denovo {} --nonchimeras non_denovo_chimeras/{/.}.nonchimera.fasta 2> non_denovo_chimeras/{/.}.nonchimera.log' ::: fasta_files_derep/*.derep.fasta

Note that the logfile for each FASTA will be in the same output directory as the chimera-filtered FASTAs.

Finally, run rereplication of your sequences so that you can continue with the standard Microbiome Helper workflow.

parallel -j 4 'vsearch --rereplicate {} --output non_denovo_chimeras_rerep/{/.}.rerep.fasta' ::: non_denovo_chimeras/*fasta

The FASTA files to carry on to downstream steps will then be in non_denovo_chimeras_rerep.

Clone this wiki locally