Skip to content

Commit

Permalink
release 0.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
sfchen committed Dec 10, 2015
1 parent ae33f56 commit 2bdf4e7
Show file tree
Hide file tree
Showing 9 changed files with 134 additions and 156 deletions.
175 changes: 100 additions & 75 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,107 +1,132 @@
# AFTER
Automatic Filtering, Trimming, and Error Removing for fastq data
Currently it supports Illumina 1.8 or newer format, see:
http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm
AFTER can simply go through all fastq files in a folder and then output a <b>good</b> folder and a <b>bad</b> folder, which contains good reads and bad reads of each fastq file
Automatic Filtering, Trimming, and Error Removing for fastq data
AFTER can simply go through all fastq files in a folder and then output a <b>good</b> folder and a <b>bad</b> folder, which contains good reads and bad reads of each fastq file
Currently it supports Illumina 1.8 or newer format, see [here](http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm)

# Version
1.0
# Latest release
0.1.0 (Released in 2015-12-10)

# Feedback/contact
[email protected]
[email protected]

# Features:
AFTER does following tasks automatically:
1, Filter PolyA/PolyT/PolyC/PolyG reads
2, Trim reads at front and tail according to bad per base sequence content
3, Detect and eliminate bubble artifact caused by sequencer due to fluid dynamics issue
4, Filter low-quality reads
4, Filter low-quality reads
5, Barcode sequencing support: if all reads have a random barcode (see duplex sequencing), this program can detect and split the barcode into query name

# Simple usage:
##### 1, cd to the folder contains all fastq files
##### 2, run:
##### python after.py
```shell
cd /path/to/fastq/folder
python after.py
```

# Debubble:
If you want to eliminate bubble artifact, run:
##### python after.py --debubble=on
```shell
python after.py --debubble=on
```

# Full usage:
###### python after.py [-d input_dir][-1 read1_file] [-2 read1_file] [-7 index1_file] [-5 index2_file] [-g good_output_folder] [-b bad_output_folder] [-f trim_front] [-t trim_tail] [-m min_quality] [-q qualified_quality] [-l max_low_quality] [-p poly_max] [-a allow_poly_mismatch] [-n max_n_count] [--debubble=on/off] [--debubble_dir=xxx] [--draw=on/off] [--read1_flag=\_R1\_] [--read2_flag=\_R2\_] [--index1_flag=\_I1\_] [--index2_flag=\_I2\_]
```shell
python after.py [-d input_dir][-1 read1_file] [-2 read1_file] [-7 index1_file] [-5 index2_file] [-g good_output_folder] [-b bad_output_folder] [-f trim_front] [-t trim_tail] [-q qualified_quality_phred] [-l unqualified_base_limit] [-p poly_size_limit] [-a allow_mismatch_in_poly] [-n n_base_limit] [--debubble=on/off] [--debubble_dir=xxx] [--draw=on/off] [--read1_flag=_R1_] [--read2_flag=_R2_] [--index1_flag=_I1_] [--index2_flag=_I2_]
```
Common options:
```shell
--version show program's version number and exit
-h, --help show this help message and exit
```
File (name) options:
```shell
Options:
* --version show program's version number and exit
* -h, --help show this help message and exit
* -1 READ1_FILE, --read1_file=READ1_FILE
-1 READ1_FILE, --read1_file=READ1_FILE
file name of read1, required. If input_dir is
specified, then this arg is ignored.
* -2 READ2_FILE, --read2_file=READ2_FILE
specified, then this arg is ignored.
-2 READ2_FILE, --read2_file=READ2_FILE
file name of read2, if paired. If input_dir is
specified, then this arg is ignored.
* -7 INDEX1_FILE, --index1_file=INDEX1_FILE
specified, then this arg is ignored.
-7 INDEX1_FILE, --index1_file=INDEX1_FILE
file name of 7' index. If input_dir is specified, then
this arg is ignored.
* -5 INDEX2_FILE, --index2_file=INDEX2_FILE
this arg is ignored.
-5 INDEX2_FILE, --index2_file=INDEX2_FILE
file name of 5' index. If input_dir is specified, then
this arg is ignored.
* -g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER
this arg is ignored.
-d INPUT_DIR, --input_dir=INPUT_DIR
the input dir to process automatically. If read1_file
are input_dir are not specified, then current dir (.)
is specified to input_dir
-g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER
the folder to store good reads, by default it is the
same folder contains read1
* -b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER
same folder contains read1
-b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER
the folder to store bad reads, by default it is same
as good_output_folder
* -f TRIM_FRONT, --trim_front=TRIM_FRONT
as good_output_folder
--read1_flag=READ1_FLAG
specify the name flag of read1, default is _R1_, which
means a file with name *_R1_* is read1 file
--read2_flag=READ2_FLAG
specify the name flag of read2, default is _R2_, which
means a file with name *_R2_* is read2 file
--index1_flag=INDEX1_FLAG
specify the name flag of index1, default is _I1_,
which means a file with name *_I1_* is index2 file
--index2_flag=INDEX2_FLAG
specify the name flag of index2, default is _I2_,
which means a file with name *_I2_* is index2 file
```
Filter options:
```shell
-f TRIM_FRONT, --trim_front=TRIM_FRONT
number of bases to be trimmed in the head of read. -1
means auto detect
* -t TRIM_TAIL, --trim_tail=TRIM_TAIL
means auto detect
-t TRIM_TAIL, --trim_tail=TRIM_TAIL
number of bases to be trimmed in the tail of read. -1
means auto detect
* -m MIN_QUALITY, --min_quality=MIN_QUALITY
if exists one base has quality < min_quality, then
this read/pair will be bad. Default 0 means do not
filter reads by the least quality
* -q QUALIFIED_QUALITY, --qualified_quality=QUALIFIED_QUALITY
means auto detect
-q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED
the quality value that a base is qualifyed. Default 20
means base quality >=Q20 is qualified.
* -l MAX_LOW_QUALITY, --max_low_quality=MAX_LOW_QUALITY
if exists more than maxlq bases that quality is lower
than qualified quality, then this read/pair is bad.
Default 0 means do not filter reads by low quality
base count
* -p POLY_MAX, --poly_max=POLY_MAX
means base quality >=Q20 is qualified.
-u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT
if exists more than unqualified_base_limit bases that
quality is lower than qualified quality, then this
read/pair is bad. Default 0 means do not filter reads
by low quality base count
-p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT
if exists one polyX(polyG means GGGGGGGGG...), and its
length is >= poly_max, then this read/pair is bad.
Default is 40
* -a ALLOW_POLY_MISMATCH, --allow_poly_mismatch=ALLOW_POLY_MISMATCH
length is >= poly_size_limit, then this read/pair is
bad. Default is 40
-a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY
the count of allowed mismatches when evaluating
poly_X. Default 5 means disallow any mismatches
* -n MAX_N_COUNT, --max_n_count=MAX_N_COUNT
poly_X. Default 5 means disallow any mismatches
-n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT
if exists more than maxn bases have N, then this
read/pair is bad. Default is 5
* -s MIN_SEQ_LEN, --min_seq_len=MIN_SEQ_LEN
if the trimmed read is shorter than min_seq_len, then
this read/pair is bad. Default is 35
* -d INPUT_DIR, --input_dir=INPUT_DIR
the input dir to process automatically. If read1_file
are input_dir are not specified, then current dir (.)
is specified to input_dir
* --debubble=DEBUBBLE specify whether apply debubble algorithm to remove the
reads in the bubbles. Default is off
* --debubble_dir=DEBUBBLE_DIR
read/pair is bad. Default is 5
-s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ
if the trimmed read is shorter than seq_len_req, then
this read/pair is bad. Default is 35
```
Debubble options:
```shell
--debubble=DEBUBBLE specify whether apply debubble algorithm to remove the
reads in the bubbles. Default is off
--debubble_dir=DEBUBBLE_DIR
specify the folder to store output of debubble
algorithm, default is debubble
* --draw=DRAW specify whether draw the pictures or not, when use
debubble or QC. Default is on
* --read1_flag=READ1_FLAG
specify the name flag of read1, default is _R1_, which
means a file with name *_R1_* is read1 file
* --read2_flag=READ2_FLAG
specify the name flag of read2, default is _R2_, which
means a file with name *_R2_* is read2 file
* --index1_flag=INDEX1_FLAG
specify the name flag of index1, default is _I1_,
which means a file with name *_I1_* is index2 file
* --index2_flag=INDEX2_FLAG
specify the name flag of index2, default is _I2_,
which means a file with name *_I2_* is index2 file
algorithm, default is debubble
--draw=DRAW specify whether draw the pictures or not, when use
debubble or QC. Default is on
```
Barcoded sequencing options:
```
--barcode=BARCODE specify whether deal with barcode sequencing files, default is on
--barcode_length=BARCODE_LENGTH
specify the designed length of barcode
--barcode_flag=BARCODE_FLAG
specify the name flag of a barcoded file, default is
barcode, which means a file with name *barcode* is a
barcoded file
--barcode=BARCODE specify whether deal with barcode sequencing files,
default is on, which means all files with barcode_flag
in filename will be treated as barcode sequencing
files
```
Loading

0 comments on commit 2bdf4e7

Please sign in to comment.