-
Notifications
You must be signed in to change notification settings - Fork 51
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
134 additions
and
156 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,107 +1,132 @@ | ||
# AFTER | ||
Automatic Filtering, Trimming, and Error Removing for fastq data | ||
Currently it supports Illumina 1.8 or newer format, see: | ||
http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm | ||
AFTER can simply go through all fastq files in a folder and then output a <b>good</b> folder and a <b>bad</b> folder, which contains good reads and bad reads of each fastq file | ||
Automatic Filtering, Trimming, and Error Removing for fastq data | ||
AFTER can simply go through all fastq files in a folder and then output a <b>good</b> folder and a <b>bad</b> folder, which contains good reads and bad reads of each fastq file | ||
Currently it supports Illumina 1.8 or newer format, see [here](http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm) | ||
|
||
# Version | ||
1.0 | ||
# Latest release | ||
0.1.0 (Released in 2015-12-10) | ||
|
||
# Feedback/contact | ||
[email protected] | ||
[email protected] | ||
|
||
# Features: | ||
AFTER does following tasks automatically: | ||
1, Filter PolyA/PolyT/PolyC/PolyG reads | ||
2, Trim reads at front and tail according to bad per base sequence content | ||
3, Detect and eliminate bubble artifact caused by sequencer due to fluid dynamics issue | ||
4, Filter low-quality reads | ||
4, Filter low-quality reads | ||
5, Barcode sequencing support: if all reads have a random barcode (see duplex sequencing), this program can detect and split the barcode into query name | ||
|
||
# Simple usage: | ||
##### 1, cd to the folder contains all fastq files | ||
##### 2, run: | ||
##### python after.py | ||
```shell | ||
cd /path/to/fastq/folder | ||
python after.py | ||
``` | ||
|
||
# Debubble: | ||
If you want to eliminate bubble artifact, run: | ||
##### python after.py --debubble=on | ||
```shell | ||
python after.py --debubble=on | ||
``` | ||
|
||
# Full usage: | ||
###### python after.py [-d input_dir][-1 read1_file] [-2 read1_file] [-7 index1_file] [-5 index2_file] [-g good_output_folder] [-b bad_output_folder] [-f trim_front] [-t trim_tail] [-m min_quality] [-q qualified_quality] [-l max_low_quality] [-p poly_max] [-a allow_poly_mismatch] [-n max_n_count] [--debubble=on/off] [--debubble_dir=xxx] [--draw=on/off] [--read1_flag=\_R1\_] [--read2_flag=\_R2\_] [--index1_flag=\_I1\_] [--index2_flag=\_I2\_] | ||
```shell | ||
python after.py [-d input_dir][-1 read1_file] [-2 read1_file] [-7 index1_file] [-5 index2_file] [-g good_output_folder] [-b bad_output_folder] [-f trim_front] [-t trim_tail] [-q qualified_quality_phred] [-l unqualified_base_limit] [-p poly_size_limit] [-a allow_mismatch_in_poly] [-n n_base_limit] [--debubble=on/off] [--debubble_dir=xxx] [--draw=on/off] [--read1_flag=_R1_] [--read2_flag=_R2_] [--index1_flag=_I1_] [--index2_flag=_I2_] | ||
``` | ||
Common options: | ||
```shell | ||
--version show program's version number and exit | ||
-h, --help show this help message and exit | ||
``` | ||
File (name) options: | ||
```shell | ||
Options: | ||
* --version show program's version number and exit | ||
* -h, --help show this help message and exit | ||
* -1 READ1_FILE, --read1_file=READ1_FILE | ||
-1 READ1_FILE, --read1_file=READ1_FILE | ||
file name of read1, required. If input_dir is | ||
specified, then this arg is ignored. | ||
* -2 READ2_FILE, --read2_file=READ2_FILE | ||
specified, then this arg is ignored. | ||
-2 READ2_FILE, --read2_file=READ2_FILE | ||
file name of read2, if paired. If input_dir is | ||
specified, then this arg is ignored. | ||
* -7 INDEX1_FILE, --index1_file=INDEX1_FILE | ||
specified, then this arg is ignored. | ||
-7 INDEX1_FILE, --index1_file=INDEX1_FILE | ||
file name of 7' index. If input_dir is specified, then | ||
this arg is ignored. | ||
* -5 INDEX2_FILE, --index2_file=INDEX2_FILE | ||
this arg is ignored. | ||
-5 INDEX2_FILE, --index2_file=INDEX2_FILE | ||
file name of 5' index. If input_dir is specified, then | ||
this arg is ignored. | ||
* -g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER | ||
this arg is ignored. | ||
-d INPUT_DIR, --input_dir=INPUT_DIR | ||
the input dir to process automatically. If read1_file | ||
are input_dir are not specified, then current dir (.) | ||
is specified to input_dir | ||
-g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER | ||
the folder to store good reads, by default it is the | ||
same folder contains read1 | ||
* -b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER | ||
same folder contains read1 | ||
-b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER | ||
the folder to store bad reads, by default it is same | ||
as good_output_folder | ||
* -f TRIM_FRONT, --trim_front=TRIM_FRONT | ||
as good_output_folder | ||
--read1_flag=READ1_FLAG | ||
specify the name flag of read1, default is _R1_, which | ||
means a file with name *_R1_* is read1 file | ||
--read2_flag=READ2_FLAG | ||
specify the name flag of read2, default is _R2_, which | ||
means a file with name *_R2_* is read2 file | ||
--index1_flag=INDEX1_FLAG | ||
specify the name flag of index1, default is _I1_, | ||
which means a file with name *_I1_* is index2 file | ||
--index2_flag=INDEX2_FLAG | ||
specify the name flag of index2, default is _I2_, | ||
which means a file with name *_I2_* is index2 file | ||
``` | ||
Filter options: | ||
```shell | ||
-f TRIM_FRONT, --trim_front=TRIM_FRONT | ||
number of bases to be trimmed in the head of read. -1 | ||
means auto detect | ||
* -t TRIM_TAIL, --trim_tail=TRIM_TAIL | ||
means auto detect | ||
-t TRIM_TAIL, --trim_tail=TRIM_TAIL | ||
number of bases to be trimmed in the tail of read. -1 | ||
means auto detect | ||
* -m MIN_QUALITY, --min_quality=MIN_QUALITY | ||
if exists one base has quality < min_quality, then | ||
this read/pair will be bad. Default 0 means do not | ||
filter reads by the least quality | ||
* -q QUALIFIED_QUALITY, --qualified_quality=QUALIFIED_QUALITY | ||
means auto detect | ||
-q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED | ||
the quality value that a base is qualifyed. Default 20 | ||
means base quality >=Q20 is qualified. | ||
* -l MAX_LOW_QUALITY, --max_low_quality=MAX_LOW_QUALITY | ||
if exists more than maxlq bases that quality is lower | ||
than qualified quality, then this read/pair is bad. | ||
Default 0 means do not filter reads by low quality | ||
base count | ||
* -p POLY_MAX, --poly_max=POLY_MAX | ||
means base quality >=Q20 is qualified. | ||
-u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT | ||
if exists more than unqualified_base_limit bases that | ||
quality is lower than qualified quality, then this | ||
read/pair is bad. Default 0 means do not filter reads | ||
by low quality base count | ||
-p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT | ||
if exists one polyX(polyG means GGGGGGGGG...), and its | ||
length is >= poly_max, then this read/pair is bad. | ||
Default is 40 | ||
* -a ALLOW_POLY_MISMATCH, --allow_poly_mismatch=ALLOW_POLY_MISMATCH | ||
length is >= poly_size_limit, then this read/pair is | ||
bad. Default is 40 | ||
-a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY | ||
the count of allowed mismatches when evaluating | ||
poly_X. Default 5 means disallow any mismatches | ||
* -n MAX_N_COUNT, --max_n_count=MAX_N_COUNT | ||
poly_X. Default 5 means disallow any mismatches | ||
-n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT | ||
if exists more than maxn bases have N, then this | ||
read/pair is bad. Default is 5 | ||
* -s MIN_SEQ_LEN, --min_seq_len=MIN_SEQ_LEN | ||
if the trimmed read is shorter than min_seq_len, then | ||
this read/pair is bad. Default is 35 | ||
* -d INPUT_DIR, --input_dir=INPUT_DIR | ||
the input dir to process automatically. If read1_file | ||
are input_dir are not specified, then current dir (.) | ||
is specified to input_dir | ||
* --debubble=DEBUBBLE specify whether apply debubble algorithm to remove the | ||
reads in the bubbles. Default is off | ||
* --debubble_dir=DEBUBBLE_DIR | ||
read/pair is bad. Default is 5 | ||
-s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ | ||
if the trimmed read is shorter than seq_len_req, then | ||
this read/pair is bad. Default is 35 | ||
``` | ||
Debubble options: | ||
```shell | ||
--debubble=DEBUBBLE specify whether apply debubble algorithm to remove the | ||
reads in the bubbles. Default is off | ||
--debubble_dir=DEBUBBLE_DIR | ||
specify the folder to store output of debubble | ||
algorithm, default is debubble | ||
* --draw=DRAW specify whether draw the pictures or not, when use | ||
debubble or QC. Default is on | ||
* --read1_flag=READ1_FLAG | ||
specify the name flag of read1, default is _R1_, which | ||
means a file with name *_R1_* is read1 file | ||
* --read2_flag=READ2_FLAG | ||
specify the name flag of read2, default is _R2_, which | ||
means a file with name *_R2_* is read2 file | ||
* --index1_flag=INDEX1_FLAG | ||
specify the name flag of index1, default is _I1_, | ||
which means a file with name *_I1_* is index2 file | ||
* --index2_flag=INDEX2_FLAG | ||
specify the name flag of index2, default is _I2_, | ||
which means a file with name *_I2_* is index2 file | ||
algorithm, default is debubble | ||
--draw=DRAW specify whether draw the pictures or not, when use | ||
debubble or QC. Default is on | ||
``` | ||
Barcoded sequencing options: | ||
``` | ||
--barcode=BARCODE specify whether deal with barcode sequencing files, default is on | ||
--barcode_length=BARCODE_LENGTH | ||
specify the designed length of barcode | ||
--barcode_flag=BARCODE_FLAG | ||
specify the name flag of a barcoded file, default is | ||
barcode, which means a file with name *barcode* is a | ||
barcoded file | ||
--barcode=BARCODE specify whether deal with barcode sequencing files, | ||
default is on, which means all files with barcode_flag | ||
in filename will be treated as barcode sequencing | ||
files | ||
``` |
Oops, something went wrong.