-
Notifications
You must be signed in to change notification settings - Fork 0
aaschaffer/rRNA_sensor
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
rRNA_sensor: a script to recognize SSU rRNA sequences Authors: Alejandro Schaffer and Eric Nawrocki github: https://github.com/aaschaffer/rRNA_sensor.git Version: 0.14 Dec 2020 ------------------ README Outline of this file: SETTING UP ENVIRONMENT VARIABLES BLAST REQUIREMENT SAMPLE RUN ############################################################################## SETTING UP ENVIRONMENT VARIABLES Before you can run rRNA_sensor_script you will need to update some of your command line variables. To do this, add the following three lines to your .bashrc file (if you use bash shell) or .cshrc file (if you use C shell or tcsh). The .bashrc or .cshrc file will be in your home directory. To determine what shell you use type 'echo $SHELL', if it returns '/bin/bash', then update your .bashrc file, if it returns '/bin/csh' or '/bin/tcsh' then update your .cshrc file. The 3 lines to add to your .bashrc file, ***but make sure that you replace PATH/TO/RRNA/SENSOR with the directory path to the directory created when you cloned the rRNA_sensor github repository. ----------- export RRNASENSORDIR="PATH/TO/RRNA/SENSOR" export PATH="$RRNASENSORDIR:$PATH"; export BLASTDB="$RRNASENSORDIR:$BLASTDB"; ----------- The 3 lines to add to your .cshrc file: ----------- setenv RRNASENSORDIR "PATH/TO/RRNA/SENSOR" setenv PATH "$RRNASENSORDIR":"$PATH" setenv BLASTDB "$RRNASENSORDIR":"$BLASTDB" ----------- Then, after adding those 3 lines, execute this command: source ~/.bashrc OR source ~/.cshrc To check that your environment variables are properly set up do the following three commands: echo $RRNASENSORDIR echo $PATH echo $BLASTDB The first command should return only: /PATH/TO/RRNA/SENSOR The second echo commands should return a potentially longer string that begins with the same path: /PATH/TO/RRNA/SENSOR And the third echo command should return a potentially larger string that includes (but not necessarily begins with) the same path: /PATH/TO/RRNA/SENSOR If that is not the case, please email Eric Nawrocki ([email protected]). If you do see the expected output, the following sample run should work. ############################################################################## BLAST REQUIREMENT rRNA_sensor requires blastn version 2.11.0+ or later. You'll need to install it and have 'blastn' in your PATH for rRNA_sensor_script to work. Alternatively, you can download and install Ribovore (https://github.com/ncbi/ribovore), which will install blast and rRNA_sensor automatically using the `install.sh` script here: https://raw.githubusercontent.com/ncbi/ribovore/master/install.sh ############################################################################## SAMPLE RUN Execute the following command: 16S sample run: rRNA_sensor_script 400 2500 $RRNASENSORDIR/test_data/r.100.ssu.16S.fa class.txt 80 1e-40 1 sensor-out 16S_centroids You should see this output: Final output saved as sensor-out/class.txt [rRNA_sensor v0.14] 18S sample run: rRNA_sensor_script 400 2500 $RRNASENSORDIR/test_data/r.100.ssu.18S.fa class.txt 80 1e-40 1 sensor-out2 18S_centroids.1091 You should see this output: Final output saved as sensor-out2/class.txt [rRNA_sensor v0.14] The first argument is a lower bound on length. The second argument is an upper bound on length. The third argument is a set of input queries in FASTA format; it is assumed that the first token after the > on the defline identifies the query and that all query identifiers in the file are distinct. The fourth argument is the output file. The fifth argument is a lower bound on the identity percentage. The sixth argument is an upper bound of the E-value The seventh argument is the number of threads to use in running blastn. The eigth argument is the name of a directory to store output files. The ninth argument is the name of the BLAST DB. The two bounds on the right are applied in a logical AND fashion, so we look for matches that are above the identity percentage threshold and below the E-value threshold The bounds 80 and 1e-40 worked well on several tests. In contrast, we did not find a general setting of the length lower and upper bounds that works well on all tests. This is because different experiments to collect 16S sequences use different primer pairs that naturally lead to different expected lengths. The bounds shown of 400 and 2500 worked well on a heterogeneous (not from a single laboratory) set of 16S sequences from GenBank's nr database. In the context of submissions, however, one would typically expect that a) the lengths are homogeneous and b) the submitter should be able to specify what range of lengths is expected. Therefore, when this software is used in the context of submissions, we recommend to set the length parameters based on advice from the submitter, rather than by guessing or by having a default setting. Files needed: This file: README 16S BLAST database: 16S_centroids.nhr 16S_centroids.nin 16S_centroids.nsq 18S BLAST database: 18S_centroids.1091.nhr 18S_centroids.1091.nin 18S_centroids.1091.nsq Programs used by the script: sensor_classification3.pl sensor_partition_by_length2.pl Sample input nucl_queries.fsa The current script: rRNA_sensor_script The output has one row per query. Column 1 is the query identifier. Column 2 is the classification. Column 3 is the strand of the sequence relative to the strand of the matching database 16S sequence, or NA if there is no match. Column 4 is the number of local alignments between the sequence and the best matching database 16S sequence, or NA if there is no match. Column 5 is the query coverage percentage of the best alignment between sequence and the best matching database 16S sequence, or NA if there is no match. All queries shorter than lower length threshold are classified as too short, although using the example lower threshold of 400, they could be partial 16S RNA sequences. All queries longer than the upper length threshold are classified as too long, although in reality they could be genomic sequences that include a 16S RNA sequence somewhere inside. All sequences with length between the two thresholds are classified as one of: yes no imperfect_match "imperfect_match" means that there is match to the 16S database but either its identity percentage is below threshold or its E-value is above threshold (for the best match), while "no" means that there is not a significant match at all. In column 3, plus means same strand and minus means opposite strand, and mixed means at least one one alignment on each strand In column 4, values > 1 might be evidence of a missassembly or gap in the query sequence. If column 2 has the value no, then the values in column 3 and column 4 will appear as NA (short, for not applicable). -------------- The blastn command used by rRNA_sensor can be found in the rRNA_sensor_script file. The makeblastdb command used to build the 16S and 18S blast databases was: makeblastdb -in 16S_centroids -dbtype nucl makeblastdb -in 18S_centroids.1091 -dbtype nucl When a file named 16S_centroids that is identical to 16S_centroids.fa existed and a file named 18S_centroids.1091 that is identical to 18S_centroids.1091.fa. -------------- Last updated: EPN, Mon Dec 21 13:53:37 2020
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published