Skip to content

aaschaffer/rRNA_sensor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rRNA_sensor: a script to recognize SSU rRNA sequences
Authors: Alejandro Schaffer and Eric
Nawrocki

github: https://github.com/aaschaffer/rRNA_sensor.git
Version: 0.14 Dec 2020
------------------
README

Outline of this file:
SETTING UP ENVIRONMENT VARIABLES
BLAST REQUIREMENT
SAMPLE RUN

##############################################################################
SETTING UP ENVIRONMENT VARIABLES

Before you can run rRNA_sensor_script you will need to update some of your
command line variables. To do this, add the following three lines to
your .bashrc file (if you use bash shell) or .cshrc file (if you use C
shell or tcsh). The .bashrc or .cshrc file will be in your home
directory. To determine what shell you use type 'echo $SHELL', if it
returns '/bin/bash', then update your .bashrc file, if it returns
'/bin/csh' or '/bin/tcsh' then update your .cshrc file.

The 3 lines to add to your .bashrc file, ***but make sure that you
replace PATH/TO/RRNA/SENSOR with the directory path to the directory
created when you cloned the rRNA_sensor github repository.
-----------
export RRNASENSORDIR="PATH/TO/RRNA/SENSOR"
export PATH="$RRNASENSORDIR:$PATH";
export BLASTDB="$RRNASENSORDIR:$BLASTDB";
-----------

The 3 lines to add to your .cshrc file:
-----------
setenv RRNASENSORDIR "PATH/TO/RRNA/SENSOR"
setenv PATH "$RRNASENSORDIR":"$PATH"
setenv BLASTDB "$RRNASENSORDIR":"$BLASTDB"
-----------

Then, after adding those 3 lines, execute this command:
source ~/.bashrc
OR
source ~/.cshrc

To check that your environment variables are properly set up do the
following three commands:
echo $RRNASENSORDIR
echo $PATH
echo $BLASTDB

The first command should return only:
/PATH/TO/RRNA/SENSOR

The second echo commands should return a potentially longer
string that begins with the same path:
/PATH/TO/RRNA/SENSOR

And the third echo command should return a potentially larger string
that includes (but not necessarily begins with) the same path: 
/PATH/TO/RRNA/SENSOR

If that is not the case, please email Eric Nawrocki
([email protected]). If you do see the expected output, the
following sample run should work.

##############################################################################
BLAST REQUIREMENT

rRNA_sensor requires blastn version 2.11.0+ or later. You'll need to install it
and have 'blastn' in your PATH for rRNA_sensor_script to work.

Alternatively, you can download and install Ribovore
(https://github.com/ncbi/ribovore), which will install blast and
rRNA_sensor automatically using the `install.sh` script here:
https://raw.githubusercontent.com/ncbi/ribovore/master/install.sh

##############################################################################
SAMPLE RUN

Execute the following command:

16S sample run:
rRNA_sensor_script 400 2500 $RRNASENSORDIR/test_data/r.100.ssu.16S.fa class.txt 80 1e-40 1 sensor-out 16S_centroids

You should see this output:
Final output saved as sensor-out/class.txt [rRNA_sensor v0.14]

18S sample run:
rRNA_sensor_script 400 2500 $RRNASENSORDIR/test_data/r.100.ssu.18S.fa class.txt 80 1e-40 1 sensor-out2 18S_centroids.1091

You should see this output:
Final output saved as sensor-out2/class.txt [rRNA_sensor v0.14]

The first argument is a lower bound on length.
The second argument is an upper bound on length.
The third argument is a set of input queries in FASTA format; it is assumed that
the first token after the > on the defline identifies the query and that
all query identifiers in the file are distinct.
The fourth argument is the output file.
The fifth argument is a lower bound on the identity percentage.
The sixth argument is an upper bound of the E-value
The seventh argument is the number of threads to use in running blastn.
The eigth argument is the name of a directory to store output files.
The ninth argument is the name of the BLAST DB.

The two bounds on the right are applied in a logical AND fashion, so we look for
matches that are above the identity percentage threshold and below the E-value threshold
The bounds 80 and 1e-40 worked well on several tests.
In contrast, we did not find a general setting of the
length lower and upper bounds that works well on all tests. This
is because different experiments to collect 16S sequences use
different primer pairs that naturally lead to different expected lengths.
The bounds shown of 400 and 2500 worked well on a heterogeneous (not from a single
laboratory) set of 16S sequences from GenBank's nr database. In the context
of submissions, however, one would typically expect that
a) the lengths are homogeneous
and
b) the submitter should be able to specify what range of lengths is expected.
Therefore, when this software is used in the context of submissions, we recommend
to set the length parameters based on advice from the submitter, rather than by guessing
or by having a default setting.

Files needed:
This file:
README

16S BLAST database:
16S_centroids.nhr
16S_centroids.nin
16S_centroids.nsq

18S BLAST database:
18S_centroids.1091.nhr
18S_centroids.1091.nin
18S_centroids.1091.nsq

Programs used by the script:
sensor_classification3.pl
sensor_partition_by_length2.pl

Sample input
nucl_queries.fsa

The current script:
rRNA_sensor_script

The output has one row per query.
Column 1 is the query identifier.
Column 2 is the classification.
Column 3 is the strand of the sequence relative to the strand of the matching database 16S sequence, or NA if there is no match.
Column 4 is the number of local alignments between the sequence and the best matching database 16S sequence, or NA if there is no match.
Column 5 is the query coverage percentage of the best alignment between sequence and the best matching database 16S sequence, or NA if there is no match.

All queries shorter than lower length threshold are classified as too short, although using the example lower
threshold of 400, they could be partial 16S RNA sequences.

All queries longer than the upper length threshold are classified as too long, although in reality they
could be genomic sequences that include a 16S RNA sequence somewhere inside.

All sequences with length between the two thresholds are classified as one of:
yes
no
imperfect_match

"imperfect_match" means that there is match to the 16S database but either its identity
percentage is below threshold or its E-value is above threshold (for the best match), while
"no" means that there is not a significant match at all. 

In column 3, plus means same strand and minus means opposite strand, and mixed means at least one
             one alignment on each strand

In column 4, values > 1 might be evidence of a missassembly or gap in the query sequence.

If column 2 has the value no, then the values in column 3 and column 4 will appear as NA (short, for not applicable).

--------------
The blastn command used by rRNA_sensor can be found in the rRNA_sensor_script file.

The makeblastdb command used to build the 16S and 18S blast databases
was:

makeblastdb -in 16S_centroids -dbtype nucl

makeblastdb -in 18S_centroids.1091 -dbtype nucl

When a file named 16S_centroids that is identical to 16S_centroids.fa
existed and a file named 18S_centroids.1091 that is identical to
18S_centroids.1091.fa.

--------------
Last updated: EPN, Mon Dec 21 13:53:37 2020

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published