Skip to content

Run vecscreen and interpret and annotate output matches taking taxonomy into account

Notifications You must be signed in to change notification settings

aaschaffer/vecscreen_plus_taxonomy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vecscreen_plus_taxonomy
Author: Alejandro Schaffer
Documentation help and code review: Eric Nawrocki 

github: https://github.com/aaschaffer/vecscreen_plus_taxonomy.git
Version: 0.17
December 2020
------------------
README

This file describes

-- a script called from_vecscreen_to_summary.pl and its constituent
programs

and
-- a summary classification script compare_vector_matches_wtaxa.pl
that can be used to run vecscreen and parse its outputs, including
taxonomy information.

There is a related repository 

github: https://github.com/aaschaffer/generate_vecscreen_candidates.git

Outline of this file:

BACKGROUND
SETTING UP ENVIRONMENT VARIABLES
SAMPLE RUN, USAGE AND OUTPUT OF from_vecscreen_to_summary.pl
SAMPLE RUN, USAGE AND OUTPUT OF compare_vector_matches_w_taxa.pl
METHODS
TEST SCRIPT
RELEVANT FILES
CREATING THE TAXONOMY FILE
AN ADDITIONAL TAXONOMY HELPER PROGRAM
COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH

**************
**BACKGROUND**
**************

vecscreen is the established NCBI program to identify matches between
(query) sequences and (subject) vectors in UniVec. These matches may
represent (true) vector contamination, but experience has shown that
there can be many false positives.  A primary reason for false
positives is that the query and the matching subject segment come from
the same genus or closely related genera. Therefore, knowing the genus
of the query sequence is helpful to interpret the output.  For what
follows below, it is relevant that vecscreen distinguishes matches by
two characteristics:

1) Location: Internal or Terminal 
A match is Terminal if and only if it includes a nucleotide within 25
positions of either end of the query; otherwise, the match is Internal.

2) Strength: Strong, Moderate, or Weak
A match is Strong if either: it is Terminal with a raw score of at
least 24 or it is Internal with a raw score of at least 30.

A match is Moderate if either: it is Terminal with a raw score in the
interval [19,23] or Internal with a raw score in the interval [25,29].

A match is Weak if it is Terminal with a raw score in the interval
[16,18] or Internal with a raw score in the interval [23,24].

vecscreen also reports Internal alignments with raw scores in the
range [16, 22] when there is also a reportable match for the same
(query, vector) pair. The score range [16,22] is below the Weak range
match for Internal matches. In the script
from_vecscreen_to_summary.pl, these below-Weak Internal matches are
assigned the level None.

More information about vecscreen can be found at
https://www.ncbi.nlm.nih.gov/tools/vecscreen/about/
In this document, 'vecscreen' refers to the command-line version, not
the Web page version.

The programs and data in this vecscreen_plus_taxonomy repository are
for prospective analysis of nucleotide sequences, especially
sequences that one might want to submit to GenBank.

The related repository 

github: https://github.com/aaschaffer/generate_vecscreen_candidates.git

does a preprocessing step relevant to only retrospective analysis of
sequences already in a database set up for BLAST, such as the non-redundant
(nr) database in GenBank. In retrospective analysis, one wants to start
by searching the database for sequences that may be expected to have vecscreen
matches, so as to avoid running vecscreen on the entire database. 

************************************
**SETTING UP ENVIRONMENT VARIABLES**
************************************

Before you can run from_vecscreen_to_summary.pl or
compare_vector_matches_wtaxa.pl you will need to update some of your
environment variables. To do this, add the following four lines to
your .bashrc file (if you use bash shell) or .cshrc file (if you use C
shell or tcsh). The .bashrc or .cshrc file will be in your home
directory. To determine what shell you use type 'echo $SHELL', if it
returns '/bin/bash', then update your .bashrc file, if it returns
'/bin/csh' or '/bin/tcsh' then update your .cshrc file.

The 4 lines to add to your .bashrc file, ***but make sure that you
replace PATH/TO/VEC/PLUS with the directory path to the directory
created when you cloned the vecscreen_plus_taxonomy github repository.

-----------
export VECPLUSDIR="PATH/TO/VEC/PLUS"
export PERL5LIB="$VECPLUSDIR:$PERL5LIB"
export PATH="$VECPLUSDIR/scripts:$PATH"
export BLASTDB="$VECPLUSDIR/univec-files:$BLASTDB"
-----------

The 4 lines to add to your .cshrc file:
-----------
setenv VECPLUSDIR "PATH/TO/VEC/PLUS"
setenv PERL5LIB "$VECPLUSDIR":"$PERL5LIB"
setenv PATH "$VECPLUSDIR/scripts":"$PATH"
setenv BLASTDB "$VECPLUSDIR/univec-files":"$BLASTDB"
-----------

Then, after adding those 4 lines, execute this command:
source ~/.bashrc
OR
source ~/.cshrc

If PERL5LIB was not already defined (you will know, if you get an error
message when you run the above 'source' command): 
use instead
export PERL5LIB="$VECPLUSDIR"
for .bashrc, OR
setenv PERL5LIB "$VECPLUSDIR"
for .cshrc.
at line 2 out of 4. 

Similarly, if BLASTDB was not already defined (you will know, if you
get an error message when you run the above 'source' command): 
use instead
export BLASTDB="$VECPLUSDIR/univec-files"
for .bashrc, OR
setenv BLASTDB "$VECPLUSDIR/univec-files"
for .cshrc.
at line 4 out of 4. 

To check that your environment variables are properly set up do the
following four commands:
echo $VECPLUSDIR
echo $PERL5LIB
echo $PATH
echo $BLASTDB

The first command should return only:
PATH/TO/VEC/PLUS

The second echo command should return a potentially longer
string that begins with the same path:
PATH/TO/VEC/PLUS

The third echo command should return a (potentially longer) string
that begins with:
PATH/TO/VEC/PLUS/scripts

The fourth echo command should return a (potentially longer) string
that begins with:
PATH/TO/VEC/PLUS/univec-files

If that is not the case, please email Alejandro Schaffer
([email protected]). If you do see the expected output, the
following sample run should work.

Finally, you will need to gunzip two exectuable files in
PATH/TO/VEC/PLUS/scripts, and make the files executable (in case they
are not already executable). Perform the following commands:

> cd $VECPLUSDIR/scripts
> gunzip vecscreen.gz
> gunzip srcchk.gz
> chmod +x vecscreen
> chmod +x srcchk

After this, you should be able to successfully complete the sample
runs below, as well as the test script in the TEST SCRIPT section
below. 

****************************************************************
**SAMPLE RUN, USAGE AND OUTPUT OF from_vecscreen_to_summary.pl**
****************************************************************

Here is an example command that will run from_vecscreen_to_summary.pl
on the fasta file myseqs.fa using the taxonomy file included in 
PATH/TO/VEC/PLUS/info-files/taxonomy_tree_wlevels.txt. 

> from_vecscreen_to_summary.pl --output_root mytest --input_fasta $VECPLUSDIR/test-files/test.input_sequence_file.fa --input_taxa $VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt --combine_output --verbose > mytest.out

This command is contained in the file
$VECPLUSDIR/scripts/sample_run.sh. In fact, it is better to just run
that shell script file, because it will also check that the expected
output is correct:

> sh $VECPLUSDIR/scripts/sample_run.sh

When you run the above 'sh' command, you should see output like this:
--
comparing expected output test-files/expected.output_combined_wtaxonomy.txt to observed output mytest.output_combined_wtaxonomy.txt
SUCCESS: Files are identical
--

After running the above from_vecscreen_to_summary.pl command, the
one-line per sequence output is in the file
mytest.output_combined_wtaxonomy.txt. This 11-column format is
described below under 'Verbose output format (enabled with --verbose
option): 11 columns'.

The output of the script to stdout, which describes briefly what the
script is doing, is in the file mytest.out.

The input sequence file
($VECPLUSDIR/test-files/test.input_sequence_file.fa) can be any
nucleotide sequence file in FASTA format.

The taxonomy file ($VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt)
is a file in a special format that includes taxonomy information based
on NCBI's taxonomy. This can either be created by the user from the
NCBI taxonomy file, or the user can use the provided file
taxonomy_tree_wlevels.txt, which was created in April 2018. If you are
using this in 2018, the provided file above should be fine. After
that, you should create a new up-to-date file. Instructions for doing
that are provided below in the section CREATING THE TAXONOMY FILE.

The options that can be provided to from_vecscreen_to_summary.pl are:
  --input_fasta <s> : REQUIRED: file name <s> with sequences in fasta format 
  --input_taxa <s>  : REQUIRED: file name <s> mapping vecscreen matches to taxa
  --output_root <s> : REQUIRED: output files will be named starting with <s>
  --verbose         : output 11 columns instead of 5
  --combine_output  : combine internal and terminal matches
  --keep	    : keep all intermediate files (e.g. vecscreen output)

The --input_fasta <s> and --input_taxa <s> and --output_root <s>
options are required when running from_vecscreen_to_summary.pl, while
--verbose and --combine_output and --keep are optional.  By default
--verbose and --combine_output and --keep are all turned off.

The --combine_output option determines whether one or two files of
output are produced.  Currently the names of the output files are
partly fixed at:
<output_root>.output_combined_wtaxonomy.txt (if --combine_output is applied) 
<output_root>.output_internal_wtaxonomy.txt (if --combine_output is not applied) 
<output_root>.output_terminal_wtaxonomy.txt (if --combine_output is not applied)

The option --verbose determines how many columns of output are
produced, as described in the OUTPUT section below. It does not cause
more diagnostic output to be printed during the execution of the
script. The extra columns are important for classifying some matches
as true or false contamination.

The option --keep determines whether the output file from running
vecscreen within the script is kept (--keep used) or deleted (--keep
not used, default).

from_vecscreen_to_summary.pl will create one line of tabular output
per vecscreen hit in the input sequence file. There are two possible
output formats. For both formats, columns are separated by tabs.

--------------------------------
Default output format: 5 columns
--------------------------------
By default (if --verbose is not used) then the format of those lines
will be the following five columns:

Column 1: Accession of query
Column 2: Genus of query if known, or 1 otherwise
Column 3: Matching vector, starting with uv|
Column 4: One end of the alignment in the vector
Column 5: The other end of the alignment in the vector

This 5 column format was agreed on for internal NCBI usage in JIRA
ticket SM-187.

----------------------------------------------------------------
Verbose output format (enabled with --verbose option): 11 columns
----------------------------------------------------------------
If --verbose is used, then each line of output will include the
following 11 columns:

Column 1:  Accession of query
Column 2:  Genus of query if known, or 1 otherwise
Column 3:  Species of query if known, or 1 otherwise
Column 4:  Lower end of the alignment in the query
Column 5:  Upper end of the alignment in the query
Column 6:  Matching vector, starting with uv|
Column 7:  One end of the alignment in the vector
Column 8:  The other end of the alignment in the vector
Column 9:  The strength of this vecscreen match 
Column 10: The strength of the strongest vecscreen match for this query
Column 11: Whether there is any dangling part (called "Suspect" by
           vecscreen) at either end of the query

A dangling part is an unmatched segment of <= 25 nucleotides.

This alternative 11-column format has been shown to be useful for some
purposes, such as correcting vector-contaminated sequences in GenBank.
Another circumstance in which the 11-column format
is essential is if there is an input sequence that has a known species
(not 1 in column 3) but do not have a known genus (1 in column 2).

*********************************************************************
**SAMPLE RUN, USAGE, AND OUTPUT OF compare_vector_matches_w_taxa.pl**
*********************************************************************

The compare_vector_matches_w_taxa.pl script takes as input the
one-line per-sequence output file from
from_vecscreen_to_summary.pl. That input file must be the 11-column
output of from_vecscreen_to_summary.pl that is created when the
--verbose option is used for that script.

Here is the example usage of compare_vector_matches_wtaxa.pl:

compare_vector_matches_wtaxa.pl \ 
--input_summary $VECPLUSDIR/test-files/test.sample_input_final_step.txt \
--input_taxa $VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt \
--input_artificial_vectors $VECPLUSDIR/info-files/artificial_whole_sequences.txt \
--input_artificial_segments $VECPLUSDIR/info-files/artificial_intervals.txt \
--input_univec_sources $VECPLUSDIR/info-files/biological_exclusions.txt \
--input_amr $VECPLUSDIR/info-files/UniVec10_vs_amr_distinct_intervals.txt \
--input_sequences $VECPLUSDIR/test-files/test.sample_candidates.fa \
--outfile my_output_final_step.txt

This command is contained in the file
$VECPLUSDIR/test-files/sample_compare_run.sh. As above, it's better to
just run that shell script file, because it will also check that the
expected output is correct:

> sh $VECPLUSDIR/scripts/sample_compare_run.sh

When you run the above 'sh' command, you should see output like this:
--
comparing expected output test-files/expected.output_final_step.txt to observed output my_outputfinal_step.txt
SUCCESS: Files are identical
--

After running this compare_vector_matches_wtaxa.pl command, the file
'my_output_final_step.txt' will include the output of
compare_vector_matches_wtaxa.pl which is the file
test-files/sample_input_final_step.txt with 3 additional columns:

Column 12: the classification of the match
Column 13: Most pertinent taxid of the vector interval 
Column 14: Lowest common ancestor of column 2 and column 13

The possible classifications in column 12 are currently:
NO_DATA, TRUE_ARTIFICIAL, TRUE_ARTIFICIAL_MICROSAT, FALSE_AMR,
FALSE_BIOLOGICAL, TRUE_BIOLOGICAL, TRUE_MICROSAT, LIKELY_FALSE,BACTERIAL.
These are explained below.
In practice, one mainly wants to distinguish between:
{TRUE_ARTIFICIAL, TRUE_ARTIFICIAL_MICROSAT, TRUE_BIOLOGICAL, TRUE_MICROSAT} which are true contamination,
versus
{FALSE_AMR, FALSE_BIOLOGICAL, LIKELY_FALSE,BACTERIAL}, which are false contamination.


In some cases, the classification TRUE_BIOLOGICAL may be too bold and
this can be seen because Column 14 is not much higher up the taxonomy
tree than column 13.  When this happens, the conservation of the
vector source needs to be propagated from genus_level_exclusions.txt
to one of the higher-level files*:
   superkingdom_level_exclusions.txt
   kingdom_level_exclusions.txt
   phylum_level_exclusions.txt
   class_level_exclusions.txt
   order_level_exclusions.txt
   family_level_exclusions.txt
   tribe_level_exclusions.txt

* Please email [email protected] if you find any examples of
vector sources that should be propagated to a higher level of taxonomy.

***********
**METHODS**
***********

Given the input taxonomy file and an input sequence file in FASTA
format, from_vecscreen_to_summary.pl will do the following:

1) Run vecscreen on the input FASTA-formatted sequence file to
   identify high-scoring matches to known vector sequences in UniVec
   in the input sequence file.

2) Parse the vecscreen output into two tab-delimited files for
   internal and terminal matches by calling parse_vecscreen.pl.

3) Optionally combine the two summary files into one by calling
   combine_summaries.pl.

4) Add taxonomy information to the vecscreen output by calling srcchk
   and add_taxonomy.pl.

compare_vector_matches_wtaxa.pl uses the six sets of data files listed
to classify each vecscreen match.  This program is separate because an
in-house usage needed somewhat different I/O specifications to fit
into an existing software framework.

***************
**TEST SCRIPT**
***************
There is a 'test' script included in the vecscreen_plus_taxonomy
distribution that you should run to make sure that everything is set
up correctly, in addition to doing the two example sample runs above.

To run the test script, execute the following command:
> $VECPLUSDIR/scripts/test_vecscreen_plus_taxonomy_scripts.pl

You should see the following output:
--
Checking that required input files exist ... done.
Testing combine_summaries.pl             ... done.
Testing add_taxonomy.pl                  ... done.
Testing from_vecscreen_to_summary.pl     ... done.
Testing compare_vector_matches_wtaxa.pl  ... done.
# All tests passed.
# SUCCESS
--

If you do not see this output, make sure that you've set up your
environment variable $VECPLUSDIR correctly as explained above in the
'SETTING UP ENVIRONMENT VARIABLES' section. If you still have
problems, email [email protected].

******************
**RELEVANT FILES**
******************

Several executable files are required for from_vecscreen_to_summary.pl
and compare_vector_matches_wtaxa.pl to work. Two of these executables
were developed by others and must be downloaded and installed
separately outside of NCBI. All of these files are included in the
vecscreen_plus_taxonomy github repository, so you do not need to
create or move any files in order to get from_vecscreen_to_summary.pl
to work. After cloning the git repository, these files will be in the
PATH/TO/VEC/PLUS/scripts directory.

The first two files are NCBI executable programs that were not authored by
Alejandro Schaffer:

1. vecscreen and associated UniVec database
   Identifies vector contamination in input sequences.

   Within NCBI, vecscreen can be found at
   /netopt/ncbi_tools64/c++.stable/ReleaseMT/bin/vecscreen

   At NCBI, to add this directory to your path execute this command: 
   > ln -s /netopt/ncbi_tools64/c++.stable/ReleaseMT/bin/vecscreen .

   For users outside NCBI, we provide a gzipped executable of vecscreen
   for 64-bit Linux computers in
   scripts/vecscreen.gz
   Run
   gunzip vecscreen.gz
   chmod +x vecscreen

   and make sure that vecscreen is on the execution path (which it should
   be if you followed the steps above in SETTING UP ENVIRONMENT VARIABLES)

   vecscreen requires that the BLASTable database UniVec be
   accessible.  This means that these files must be in a directory
   that is contained in your $BLASTDB environment variable. If you
   followed the instructions in the SETTING UP ENVIRONMENT VARIABLES
   section above, you have already added the appropriate directory
   (PATH/TO/VEC/PLUS/univec-files) to $BLASTDB.  The UniVec database
   is represented (for purposes of vecscreen) in the three files:

   univec-files/UniVec.nhr
   univec-files/UniVec.nin
   univec-files/UniVec.nsq
   Do not try to edit the three UniVec files under any circumstances.

   See also the section COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH

2. srcchk 
   Determines the taxonomy of input sequences, with respect the NCBI
   taxonomy tree.

   Within NCBI, srcchk can be found at
   /netopt/ncbi_tools64/bin/srcchk

   At NCBI, to add this directory to your path execute this command:
   > ln -s /netopt/ncbi_tools64/bin/srcchk .

   For users outside NCBI, we provide a gzipped executable of srcchk
   for 64-bit Linux computers in
   scripts/srcchk.gz
   Run
   gunzip srcchk.gz
   chmod +x srcchk

   and make sure that srcchk is on the execution path (which it should
   be if you followed the steps above in SETTING UP ENVIRONMENT VARIABLES)

   See also the section COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH

The next 6 files were authored by Alejandro Schaffer:

3. from_vecscreen_to_summary.pl 
   This is the main script that coordinates the work by calling the
   other executables listed below.

4. parse_vecscreen.pl
   Auxiliary script called by from_vecscreen_to_summary.pl that parses
   the vecscreen output.
   
5. combine_summaries.pl
   Auxiliary script called by from_vecscreen_to_summary.pl that
   combines two different output formats of parse_vecscreen.pl

6. add_taxonomy.pl
   Auxiliary script called by from_vecscreen_to_summary.pl that adds
   taxonomy information to the output of parse_vecscreen.pl.

7. assign_levels_to_taxonomy.pl
   Independent script that is used to create a taxonomy file from the
   NCBI taxonomy. The file produced by assign_levels_to_taxonomy.pl is
   required input to from_vecscreen_to_summary.pl.

8. compare_vector_matches_wtaxa.pl
   Program to classify vecscreen matches that have already been parsed
   with from_vecscreen_to_summary.pl. Matches can be classified as:

 A. NO_DATA:           there is no data about the source of the vector
                       segment in the match.

 B. TRUE_ARTIFICIAL:   the vector segment matched is an ARTIFICIAL
                       sequence and hence the match is TRUE
                       contamination.

 C. TRUE_ARTIFICIAL_MICROSAT: the vector segment matched is an
                              ARTIFICIAL sequence and hence the match
                              is TRUE contamination and the vector
                              contains a microsatellite.

 D. FALSE_AMR:         the query sequence is bacterial; the vector
                       segment matches a known sequence that confers
                       anti-microbial resistance and these can often
                       be transferred horizontally between bacteria
                       that may be taxonomically distant.

 E. FALSE_BIOLOGICAL:  the subject's biological origin is known and
                       its taxid is deemed close enough to that of the
                       query, so that the match is not contamination.

 F. TRUE_BIOLOGICAL :  the query and the matching subject originate from
                       taxa that are too far apart for the vector to occur
                       plausibly in the query.

 G. TRUE_MICROSAT :    the query and the matching subject originate from
                       taxa that are too far apart for the vector to
                       occur plausibly in the query, and the vector
                       contains a microsatellite.

 H. LIKE_FALSE_BACTERIAL: the query is from uncultured bacteria and
                          the matching subject isfrom bacteria

Additionally, several sets of data files are required for
compare_vector_matches_wtaxa.pl to work. These were all included with
this software distribution. After cloning the github repository, these
files will be in the PATH/TO/VEC/PLUS/info-files directory.

1. taxonomy_tree_wlevels.txt 
   A compact form of NCBI's taxonomy tree with added fields to
   indicate for each taxid, its level and whether it is a descendant
   of the node Bacteria.  Descendants of Bacteria are treated
   specially because many sequences are now assigned to the generic
   taxid "Uncultured bacteria"

2. UniVec10_vs_amr_distinct_intervals.txt
   Intervals of vectors from UniVec version 10 that overlap with known
   antimicrobial resistance (AMR) regions.

3. artificial_intervals.txt, artificial_whole_sequences.txt
   Vector intervals or whole vectors that were generated in a
   laboratory, not from a biological source. Many of these are known
   also as "adaptors".

4. biological_exclusions.txt (which lists the following files)
   superkingdom_level_exclusions.txt
   kingdom_level_exclusions.txt
   phylum_level_exclusions.txt
   class_level_exclusions.txt
   order_level_exclusions.txt
   family_level_exclusions.txt
   tribe_level_exclusions.txt
   genus_level_exclusions.txt

   The last listed file genus_level_exclusions.txt describes the
   biological sources of vector segments at genus level. The other
   files summarize in silico sequence analysis that shows that some
   vector segments are conserved at seven taxonomic levels higher than
   genus. In these files, as well as artificial_intervals.txt:
   Column 1 is the vector segment using UniVec notation
   Columns 2 and 3 are the start and end of the interval
   Column 4 is the taxid
   Column 5 is either the Latin or English name for the taxid

5. Microsatellite_vectors.txt a list of vectors that contain
   microsatellites; Sequences that have known microsatellites and
   match to these vectors are classified specially as TRUE_MICROSAT or
   TRUE_ARTIFICAL_MICROSAT

Addtionally, the files in the 'test-files' directory created when
cloning the github repository all used by the
scripts/test_vector_plus_taxonomy.pl script for testing that your
installation and setup is working properly.

Finally, the file epn-options.pm is a perl module authored 
by Eric Nawrocki to handle command line options. It will be in the
top-level PATH/TO/VEC/PLUS directory after cloning the repository.

******************************
**CREATING THE TAXONOMY FILE**
******************************

This section describes how to create the taxonomy input file (passed
in to from_vecscreen_to_summary.pl with the --input_taxa <s> option)
using the NCBI taxonomy file.

NCBI's taxonomy is available from directory
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.

We start from any of the files: 
taxdmp.tar.Z 
taxdump.tar.gz 
taxdmp.zip
from which one can extract the file nodes.dmp.

Then execute the following commands:

> cut -f1,3,5 nodes.dmp > taxonomy_tree.txt
> assign_levels_to_taxonomy.pl --input_taxa taxonomy_tree.txt --outfile taxonomy_tree_wlevels.txt

taxonomy_tree.txt will have three columns:
Column 1: taxid as an an integer
Column 2: parent taxid as an an integer
Column 3: rank (e.g., phylum) as a word

Running assign_levels_to_taxonomy.pl will add
-- a fourth column which is the level in the taxonomy tree, where
   the root has level 1 and each child taxid has a level one greater
   than the level of its parent taxid;
-- a fifth column that is 1 if the node is a descendant of
   Bacteria and 0 if not.

******************************
**AN ADDITIONAL TAXONOMY HELPER PROGRAM**
******************************
The repository for vecscreen_plus_taxonomy also includes the helper program
find_taxonomy_ancestors.pl

The purpose of this program is to solve the following taxonomy-related problem.
Given as input one or more accessions, what are the taxid ancestors of those
accessions at some specified taxonomy level, such as order or class.

Usage:

        find_taxonomy_ancestors.pl \
        --input_summary <input file of identifiers> \
        --input_taxa <input file with NCBI's taxonomy> \
        --input_level <desired taxonomy level> \
        --outfile <output file>

The input has three columns: 1) accession 2) taxid of accession
typically, at species level 3) taxonomy name of the taxid in column 2
The output repeats the input columns and adds a fourth column with the
taxid of the ancestor at the specified level.  If there is no ancestor
at the specified level, or if the taxid is not recognized, then the
value 1 is printed instead because 1 is the root of the taxonomy tree.

find_taxonomy_ancestors.pl is not currently used within
VecScreen_plus_taxonomy, but is used in a related project.

*********************************************************
**COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH**
*********************************************************

We provide gzipped executables of vecscreen and srcchk for 64-bit Linux.
These have been tested on two non-NCBI computers, but it is not possible
to prove that they work on all 64-bit Linux computers that are sufficiently
up to date. In this context, sufficiently up to date means that all libraries
associated with gcc version 4.8 or higher are installed. The executables
provided were compiled with gcc version 4.91 because that was the least
recent version of gcc available among the versions >= 4.8.

srcchk and vecscreen are part of the NCBI C++ toolkit, which can
be downloaded at the time of writing from

ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/ncbi_cxx--18_0_0.tar.gz

The current version is 18.0.0, but the version may increase in the future.
Modify the above address as necessary to get the current version.
At this time, the retrieved file will be named ncbi_cxx--18_0_0.tar.gz.
If the version retrieved is higher, replace all occurrences of
18_0_0 below accordingly.

Run
gunzip ncbi_cxx--18_0_0.tar.gz
tar xvf ncbi_cxx--18_0_0.tar
cd ncbi_cxx--18_0_0

In case the user wants to compile vecscreen and srcchk from scratch,
we found that the following sets of steps works on Linux, but we
caution that compiling the NCBI C++ toolkit is a moving target and
instructions may need to change in the future. Let the token <GCC
version> represent the gcc version number without the decimal
points. For example, on the computer we used, the token would be 491.

After the following four commands are run, the executables for srcchk
and vecscreen should be found in the subirectory (of ncbi_cxx--18_0_0)
called GCC<GCC version>-DebugMT64/bin. 

The commands below will create files that collectively take about 25Gb
of disk space (as of current version). After the build is complete,
you can copy only the srcchk and vecscreen binaries from GCC<GCC
version>-DebugMT64/bin to PATH/TO/VEC/PLUS, or anywhere else you want,
and delete the rest of the files, if desired.

1) ./configure --without-gui --without-internal --without-boost --with-bin-release --with-flat-makefile

2) cd GCC<GCC version>-Debug

3) make -C build -f Makefile.flat all_files

4) make -C build -f Makefile.flat app/

*******************************************
Send any comments or questions to Alejandro Schaffer
([email protected])

About

Run vecscreen and interpret and annotate output matches taking taxonomy into account

Resources

Stars

Watchers

Forks

Packages

No packages published