release 0.1.0

OpenGene · Dec 10, 2015 · 2bdf4e7 · 2bdf4e7
1 parent ae33f56
commit 2bdf4e7
Show file tree

Hide file tree

Showing 9 changed files with 134 additions and 156 deletions.
diff --git a/README.md b/README.md
@@ -1,107 +1,132 @@
 # AFTER
-Automatic Filtering, Trimming, and Error Removing for fastq data  
-Currently it supports Illumina 1.8 or newer format, see:  
-http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm  
-AFTER can simply go through all fastq files in a folder and then output a <b>good</b> folder and a <b>bad</b> folder, which contains good reads and bad reads of each fastq file  
+Automatic Filtering, Trimming, and Error Removing for fastq data   
+AFTER can simply go through all fastq files in a folder and then output a <b>good</b> folder and a <b>bad</b> folder, which contains good reads and bad reads of each fastq file   
+Currently it supports Illumina 1.8 or newer format, see [here](http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm)   
 
-# Version
-1.0
+# Latest release
+0.1.0 (Released in 2015-12-10)
 
 # Feedback/contact
-[email protected]  
 [email protected]
 
 # Features:
 AFTER does following tasks automatically:  
 1, Filter PolyA/PolyT/PolyC/PolyG reads  
 2, Trim reads at front and tail according to bad per base sequence content  
 3, Detect and eliminate bubble artifact caused by sequencer due to fluid dynamics issue  
-4, Filter low-quality reads  
+4, Filter low-quality reads
+5, Barcode sequencing support: if all reads have a random barcode (see duplex sequencing), this program can detect and split the barcode into query name
 
 # Simple usage:
-##### 1, cd to the folder contains all fastq files  
-##### 2, run:  
-##### python after.py  
+```shell
+cd /path/to/fastq/folder
+python after.py
+```
 
 # Debubble:
 If you want to eliminate bubble artifact, run:  
-##### python after.py --debubble=on  
+```shell
+python after.py --debubble=on
+```
 
 # Full usage:
-###### python after.py [-d input_dir][-1 read1_file] [-2 read1_file] [-7 index1_file] [-5 index2_file] [-g good_output_folder] [-b bad_output_folder] [-f trim_front] [-t trim_tail] [-m min_quality] [-q qualified_quality] [-l max_low_quality] [-p poly_max] [-a allow_poly_mismatch] [-n max_n_count] [--debubble=on/off] [--debubble_dir=xxx] [--draw=on/off] [--read1_flag=\_R1\_] [--read2_flag=\_R2\_] [--index1_flag=\_I1\_] [--index2_flag=\_I2\_] 
+```shell
+python after.py [-d input_dir][-1 read1_file] [-2 read1_file] [-7 index1_file] [-5 index2_file] [-g good_output_folder] [-b bad_output_folder] [-f trim_front] [-t trim_tail] [-q qualified_quality_phred] [-l unqualified_base_limit] [-p poly_size_limit] [-a allow_mismatch_in_poly] [-n n_base_limit] [--debubble=on/off] [--debubble_dir=xxx] [--draw=on/off] [--read1_flag=_R1_] [--read2_flag=_R2_] [--index1_flag=_I1_] [--index2_flag=_I2_]
+```
+Common options:
+```shell
+  --version             show program's version number and exit
+  -h, --help            show this help message and exit
+```
+File (name) options:
+```shell
 
-Options:  
-  * --version             show program's version number and exit  
-  * -h, --help            show this help message and exit  
-  * -1 READ1_FILE, --read1_file=READ1_FILE  
+  -1 READ1_FILE, --read1_file=READ1_FILE
                         file name of read1, required. If input_dir is
-                        specified, then this arg is ignored.  
-  * -2 READ2_FILE, --read2_file=READ2_FILE  
+                        specified, then this arg is ignored.
+  -2 READ2_FILE, --read2_file=READ2_FILE
                         file name of read2, if paired. If input_dir is
-                        specified, then this arg is ignored.  
-  * -7 INDEX1_FILE, --index1_file=INDEX1_FILE  
+                        specified, then this arg is ignored.
+  -7 INDEX1_FILE, --index1_file=INDEX1_FILE
                         file name of 7' index. If input_dir is specified, then
-                        this arg is ignored.  
-  * -5 INDEX2_FILE, --index2_file=INDEX2_FILE  
+                        this arg is ignored.
+  -5 INDEX2_FILE, --index2_file=INDEX2_FILE
                         file name of 5' index. If input_dir is specified, then
-                        this arg is ignored.  
-  * -g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER  
+                        this arg is ignored.
+  -d INPUT_DIR, --input_dir=INPUT_DIR
+                        the input dir to process automatically. If read1_file
+                        are input_dir are not specified, then current dir (.)
+                        is specified to input_dir
+  -g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER
                         the folder to store good reads, by default it is the
-                        same folder contains read1  
-  * -b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER  
+                        same folder contains read1
+  -b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER
                         the folder to store bad reads, by default it is same
-                        as good_output_folder  
-  * -f TRIM_FRONT, --trim_front=TRIM_FRONT  
+                        as good_output_folder
+  --read1_flag=READ1_FLAG
+                        specify the name flag of read1, default is _R1_, which
+                        means a file with name *_R1_* is read1 file
+  --read2_flag=READ2_FLAG
+                        specify the name flag of read2, default is _R2_, which
+                        means a file with name *_R2_* is read2 file
+  --index1_flag=INDEX1_FLAG
+                        specify the name flag of index1, default is _I1_,
+                        which means a file with name *_I1_* is index2 file
+  --index2_flag=INDEX2_FLAG
+                        specify the name flag of index2, default is _I2_,
+                        which means a file with name *_I2_* is index2 file
+```
+Filter options:
+```shell
+  -f TRIM_FRONT, --trim_front=TRIM_FRONT
                         number of bases to be trimmed in the head of read. -1
-                        means auto detect  
-  * -t TRIM_TAIL, --trim_tail=TRIM_TAIL  
+                        means auto detect
+  -t TRIM_TAIL, --trim_tail=TRIM_TAIL
                         number of bases to be trimmed in the tail of read. -1
-                        means auto detect  
-  * -m MIN_QUALITY, --min_quality=MIN_QUALITY  
-                        if exists one base has quality < min_quality, then
-                        this read/pair will be bad. Default 0 means do not
-                        filter reads by the least quality  
-  * -q QUALIFIED_QUALITY, --qualified_quality=QUALIFIED_QUALITY  
+                        means auto detect
+  -q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED
                         the quality value that a base is qualifyed. Default 20
-                        means base quality >=Q20 is qualified.  
-  * -l MAX_LOW_QUALITY, --max_low_quality=MAX_LOW_QUALITY  
-                        if exists more than maxlq bases that quality is lower
-                        than qualified quality, then this read/pair is bad.
-                        Default 0 means do not filter reads by low quality
-                        base count  
-  * -p POLY_MAX, --poly_max=POLY_MAX  
+                        means base quality >=Q20 is qualified.
+  -u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT
+                        if exists more than unqualified_base_limit bases that
+                        quality is lower than qualified quality, then this
+                        read/pair is bad. Default 0 means do not filter reads
+                        by low quality base count
+  -p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT
                         if exists one polyX(polyG means GGGGGGGGG...), and its
-                        length is >= poly_max, then this read/pair is bad.
-                        Default is 40  
-  * -a ALLOW_POLY_MISMATCH, --allow_poly_mismatch=ALLOW_POLY_MISMATCH  
+                        length is >= poly_size_limit, then this read/pair is
+                        bad. Default is 40
+  -a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY
                         the count of allowed mismatches when evaluating
-                        poly_X. Default 5 means disallow any mismatches  
-  * -n MAX_N_COUNT, --max_n_count=MAX_N_COUNT  
+                        poly_X. Default 5 means disallow any mismatches
+  -n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT
                         if exists more than maxn bases have N, then this
-                        read/pair is bad. Default is 5  
-  * -s MIN_SEQ_LEN, --min_seq_len=MIN_SEQ_LEN  
-                        if the trimmed read is shorter than min_seq_len, then
-                        this read/pair is bad. Default is 35  
-  * -d INPUT_DIR, --input_dir=INPUT_DIR  
-                        the input dir to process automatically. If read1_file
-                        are input_dir are not specified, then current dir (.)
-                        is specified to input_dir  
-  * --debubble=DEBUBBLE   specify whether apply debubble algorithm to remove the
-                        reads in the bubbles. Default is off  
-  * --debubble_dir=DEBUBBLE_DIR  
+                        read/pair is bad. Default is 5
+  -s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ
+                        if the trimmed read is shorter than seq_len_req, then
+                        this read/pair is bad. Default is 35
+```
+Debubble options:
+```shell
+  --debubble=DEBUBBLE   specify whether apply debubble algorithm to remove the
+                        reads in the bubbles. Default is off
+  --debubble_dir=DEBUBBLE_DIR
                         specify the folder to store output of debubble
-                        algorithm, default is debubble  
-  * --draw=DRAW           specify whether draw the pictures or not, when use
-                        debubble or QC. Default is on  
-  * --read1_flag=READ1_FLAG  
-                        specify the name flag of read1, default is _R1_, which
-                        means a file with name *_R1_* is read1 file  
-  * --read2_flag=READ2_FLAG  
-                        specify the name flag of read2, default is _R2_, which
-                        means a file with name *_R2_* is read2 file  
-  * --index1_flag=INDEX1_FLAG  
-                        specify the name flag of index1, default is _I1_,
-                        which means a file with name *_I1_* is index2 file  
-  * --index2_flag=INDEX2_FLAG  
-                        specify the name flag of index2, default is _I2_,
-                        which means a file with name *_I2_* is index2 file  
+                        algorithm, default is debubble
+  --draw=DRAW           specify whether draw the pictures or not, when use
+                        debubble or QC. Default is on
+```
+Barcoded sequencing options:
+```
+  --barcode=BARCODE     specify whether deal with barcode sequencing files, default is on
+  --barcode_length=BARCODE_LENGTH
+                        specify the designed length of barcode
+  --barcode_flag=BARCODE_FLAG
+                        specify the name flag of a barcoded file, default is
+                        barcode, which means a file with name *barcode* is a
+                        barcoded file
+  --barcode=BARCODE     specify whether deal with barcode sequencing files,
+                        default is on, which means all files with barcode_flag
+                        in filename will be treated as barcode sequencing
+                        files
+```