-
Notifications
You must be signed in to change notification settings - Fork 2
aaschaffer/vecscreen_plus_taxonomy
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
vecscreen_plus_taxonomy Author: Alejandro Schaffer Documentation help and code review: Eric Nawrocki github: https://github.com/aaschaffer/vecscreen_plus_taxonomy.git Version: 0.17 December 2020 ------------------ README This file describes -- a script called from_vecscreen_to_summary.pl and its constituent programs and -- a summary classification script compare_vector_matches_wtaxa.pl that can be used to run vecscreen and parse its outputs, including taxonomy information. There is a related repository github: https://github.com/aaschaffer/generate_vecscreen_candidates.git Outline of this file: BACKGROUND SETTING UP ENVIRONMENT VARIABLES SAMPLE RUN, USAGE AND OUTPUT OF from_vecscreen_to_summary.pl SAMPLE RUN, USAGE AND OUTPUT OF compare_vector_matches_w_taxa.pl METHODS TEST SCRIPT RELEVANT FILES CREATING THE TAXONOMY FILE AN ADDITIONAL TAXONOMY HELPER PROGRAM COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH ************** **BACKGROUND** ************** vecscreen is the established NCBI program to identify matches between (query) sequences and (subject) vectors in UniVec. These matches may represent (true) vector contamination, but experience has shown that there can be many false positives. A primary reason for false positives is that the query and the matching subject segment come from the same genus or closely related genera. Therefore, knowing the genus of the query sequence is helpful to interpret the output. For what follows below, it is relevant that vecscreen distinguishes matches by two characteristics: 1) Location: Internal or Terminal A match is Terminal if and only if it includes a nucleotide within 25 positions of either end of the query; otherwise, the match is Internal. 2) Strength: Strong, Moderate, or Weak A match is Strong if either: it is Terminal with a raw score of at least 24 or it is Internal with a raw score of at least 30. A match is Moderate if either: it is Terminal with a raw score in the interval [19,23] or Internal with a raw score in the interval [25,29]. A match is Weak if it is Terminal with a raw score in the interval [16,18] or Internal with a raw score in the interval [23,24]. vecscreen also reports Internal alignments with raw scores in the range [16, 22] when there is also a reportable match for the same (query, vector) pair. The score range [16,22] is below the Weak range match for Internal matches. In the script from_vecscreen_to_summary.pl, these below-Weak Internal matches are assigned the level None. More information about vecscreen can be found at https://www.ncbi.nlm.nih.gov/tools/vecscreen/about/ In this document, 'vecscreen' refers to the command-line version, not the Web page version. The programs and data in this vecscreen_plus_taxonomy repository are for prospective analysis of nucleotide sequences, especially sequences that one might want to submit to GenBank. The related repository github: https://github.com/aaschaffer/generate_vecscreen_candidates.git does a preprocessing step relevant to only retrospective analysis of sequences already in a database set up for BLAST, such as the non-redundant (nr) database in GenBank. In retrospective analysis, one wants to start by searching the database for sequences that may be expected to have vecscreen matches, so as to avoid running vecscreen on the entire database. ************************************ **SETTING UP ENVIRONMENT VARIABLES** ************************************ Before you can run from_vecscreen_to_summary.pl or compare_vector_matches_wtaxa.pl you will need to update some of your environment variables. To do this, add the following four lines to your .bashrc file (if you use bash shell) or .cshrc file (if you use C shell or tcsh). The .bashrc or .cshrc file will be in your home directory. To determine what shell you use type 'echo $SHELL', if it returns '/bin/bash', then update your .bashrc file, if it returns '/bin/csh' or '/bin/tcsh' then update your .cshrc file. The 4 lines to add to your .bashrc file, ***but make sure that you replace PATH/TO/VEC/PLUS with the directory path to the directory created when you cloned the vecscreen_plus_taxonomy github repository. ----------- export VECPLUSDIR="PATH/TO/VEC/PLUS" export PERL5LIB="$VECPLUSDIR:$PERL5LIB" export PATH="$VECPLUSDIR/scripts:$PATH" export BLASTDB="$VECPLUSDIR/univec-files:$BLASTDB" ----------- The 4 lines to add to your .cshrc file: ----------- setenv VECPLUSDIR "PATH/TO/VEC/PLUS" setenv PERL5LIB "$VECPLUSDIR":"$PERL5LIB" setenv PATH "$VECPLUSDIR/scripts":"$PATH" setenv BLASTDB "$VECPLUSDIR/univec-files":"$BLASTDB" ----------- Then, after adding those 4 lines, execute this command: source ~/.bashrc OR source ~/.cshrc If PERL5LIB was not already defined (you will know, if you get an error message when you run the above 'source' command): use instead export PERL5LIB="$VECPLUSDIR" for .bashrc, OR setenv PERL5LIB "$VECPLUSDIR" for .cshrc. at line 2 out of 4. Similarly, if BLASTDB was not already defined (you will know, if you get an error message when you run the above 'source' command): use instead export BLASTDB="$VECPLUSDIR/univec-files" for .bashrc, OR setenv BLASTDB "$VECPLUSDIR/univec-files" for .cshrc. at line 4 out of 4. To check that your environment variables are properly set up do the following four commands: echo $VECPLUSDIR echo $PERL5LIB echo $PATH echo $BLASTDB The first command should return only: PATH/TO/VEC/PLUS The second echo command should return a potentially longer string that begins with the same path: PATH/TO/VEC/PLUS The third echo command should return a (potentially longer) string that begins with: PATH/TO/VEC/PLUS/scripts The fourth echo command should return a (potentially longer) string that begins with: PATH/TO/VEC/PLUS/univec-files If that is not the case, please email Alejandro Schaffer ([email protected]). If you do see the expected output, the following sample run should work. Finally, you will need to gunzip two exectuable files in PATH/TO/VEC/PLUS/scripts, and make the files executable (in case they are not already executable). Perform the following commands: > cd $VECPLUSDIR/scripts > gunzip vecscreen.gz > gunzip srcchk.gz > chmod +x vecscreen > chmod +x srcchk After this, you should be able to successfully complete the sample runs below, as well as the test script in the TEST SCRIPT section below. **************************************************************** **SAMPLE RUN, USAGE AND OUTPUT OF from_vecscreen_to_summary.pl** **************************************************************** Here is an example command that will run from_vecscreen_to_summary.pl on the fasta file myseqs.fa using the taxonomy file included in PATH/TO/VEC/PLUS/info-files/taxonomy_tree_wlevels.txt. > from_vecscreen_to_summary.pl --output_root mytest --input_fasta $VECPLUSDIR/test-files/test.input_sequence_file.fa --input_taxa $VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt --combine_output --verbose > mytest.out This command is contained in the file $VECPLUSDIR/scripts/sample_run.sh. In fact, it is better to just run that shell script file, because it will also check that the expected output is correct: > sh $VECPLUSDIR/scripts/sample_run.sh When you run the above 'sh' command, you should see output like this: -- comparing expected output test-files/expected.output_combined_wtaxonomy.txt to observed output mytest.output_combined_wtaxonomy.txt SUCCESS: Files are identical -- After running the above from_vecscreen_to_summary.pl command, the one-line per sequence output is in the file mytest.output_combined_wtaxonomy.txt. This 11-column format is described below under 'Verbose output format (enabled with --verbose option): 11 columns'. The output of the script to stdout, which describes briefly what the script is doing, is in the file mytest.out. The input sequence file ($VECPLUSDIR/test-files/test.input_sequence_file.fa) can be any nucleotide sequence file in FASTA format. The taxonomy file ($VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt) is a file in a special format that includes taxonomy information based on NCBI's taxonomy. This can either be created by the user from the NCBI taxonomy file, or the user can use the provided file taxonomy_tree_wlevels.txt, which was created in April 2018. If you are using this in 2018, the provided file above should be fine. After that, you should create a new up-to-date file. Instructions for doing that are provided below in the section CREATING THE TAXONOMY FILE. The options that can be provided to from_vecscreen_to_summary.pl are: --input_fasta <s> : REQUIRED: file name <s> with sequences in fasta format --input_taxa <s> : REQUIRED: file name <s> mapping vecscreen matches to taxa --output_root <s> : REQUIRED: output files will be named starting with <s> --verbose : output 11 columns instead of 5 --combine_output : combine internal and terminal matches --keep : keep all intermediate files (e.g. vecscreen output) The --input_fasta <s> and --input_taxa <s> and --output_root <s> options are required when running from_vecscreen_to_summary.pl, while --verbose and --combine_output and --keep are optional. By default --verbose and --combine_output and --keep are all turned off. The --combine_output option determines whether one or two files of output are produced. Currently the names of the output files are partly fixed at: <output_root>.output_combined_wtaxonomy.txt (if --combine_output is applied) <output_root>.output_internal_wtaxonomy.txt (if --combine_output is not applied) <output_root>.output_terminal_wtaxonomy.txt (if --combine_output is not applied) The option --verbose determines how many columns of output are produced, as described in the OUTPUT section below. It does not cause more diagnostic output to be printed during the execution of the script. The extra columns are important for classifying some matches as true or false contamination. The option --keep determines whether the output file from running vecscreen within the script is kept (--keep used) or deleted (--keep not used, default). from_vecscreen_to_summary.pl will create one line of tabular output per vecscreen hit in the input sequence file. There are two possible output formats. For both formats, columns are separated by tabs. -------------------------------- Default output format: 5 columns -------------------------------- By default (if --verbose is not used) then the format of those lines will be the following five columns: Column 1: Accession of query Column 2: Genus of query if known, or 1 otherwise Column 3: Matching vector, starting with uv| Column 4: One end of the alignment in the vector Column 5: The other end of the alignment in the vector This 5 column format was agreed on for internal NCBI usage in JIRA ticket SM-187. ---------------------------------------------------------------- Verbose output format (enabled with --verbose option): 11 columns ---------------------------------------------------------------- If --verbose is used, then each line of output will include the following 11 columns: Column 1: Accession of query Column 2: Genus of query if known, or 1 otherwise Column 3: Species of query if known, or 1 otherwise Column 4: Lower end of the alignment in the query Column 5: Upper end of the alignment in the query Column 6: Matching vector, starting with uv| Column 7: One end of the alignment in the vector Column 8: The other end of the alignment in the vector Column 9: The strength of this vecscreen match Column 10: The strength of the strongest vecscreen match for this query Column 11: Whether there is any dangling part (called "Suspect" by vecscreen) at either end of the query A dangling part is an unmatched segment of <= 25 nucleotides. This alternative 11-column format has been shown to be useful for some purposes, such as correcting vector-contaminated sequences in GenBank. Another circumstance in which the 11-column format is essential is if there is an input sequence that has a known species (not 1 in column 3) but do not have a known genus (1 in column 2). ********************************************************************* **SAMPLE RUN, USAGE, AND OUTPUT OF compare_vector_matches_w_taxa.pl** ********************************************************************* The compare_vector_matches_w_taxa.pl script takes as input the one-line per-sequence output file from from_vecscreen_to_summary.pl. That input file must be the 11-column output of from_vecscreen_to_summary.pl that is created when the --verbose option is used for that script. Here is the example usage of compare_vector_matches_wtaxa.pl: compare_vector_matches_wtaxa.pl \ --input_summary $VECPLUSDIR/test-files/test.sample_input_final_step.txt \ --input_taxa $VECPLUSDIR/info-files/taxonomy_tree_wlevels.txt \ --input_artificial_vectors $VECPLUSDIR/info-files/artificial_whole_sequences.txt \ --input_artificial_segments $VECPLUSDIR/info-files/artificial_intervals.txt \ --input_univec_sources $VECPLUSDIR/info-files/biological_exclusions.txt \ --input_amr $VECPLUSDIR/info-files/UniVec10_vs_amr_distinct_intervals.txt \ --input_sequences $VECPLUSDIR/test-files/test.sample_candidates.fa \ --outfile my_output_final_step.txt This command is contained in the file $VECPLUSDIR/test-files/sample_compare_run.sh. As above, it's better to just run that shell script file, because it will also check that the expected output is correct: > sh $VECPLUSDIR/scripts/sample_compare_run.sh When you run the above 'sh' command, you should see output like this: -- comparing expected output test-files/expected.output_final_step.txt to observed output my_outputfinal_step.txt SUCCESS: Files are identical -- After running this compare_vector_matches_wtaxa.pl command, the file 'my_output_final_step.txt' will include the output of compare_vector_matches_wtaxa.pl which is the file test-files/sample_input_final_step.txt with 3 additional columns: Column 12: the classification of the match Column 13: Most pertinent taxid of the vector interval Column 14: Lowest common ancestor of column 2 and column 13 The possible classifications in column 12 are currently: NO_DATA, TRUE_ARTIFICIAL, TRUE_ARTIFICIAL_MICROSAT, FALSE_AMR, FALSE_BIOLOGICAL, TRUE_BIOLOGICAL, TRUE_MICROSAT, LIKELY_FALSE,BACTERIAL. These are explained below. In practice, one mainly wants to distinguish between: {TRUE_ARTIFICIAL, TRUE_ARTIFICIAL_MICROSAT, TRUE_BIOLOGICAL, TRUE_MICROSAT} which are true contamination, versus {FALSE_AMR, FALSE_BIOLOGICAL, LIKELY_FALSE,BACTERIAL}, which are false contamination. In some cases, the classification TRUE_BIOLOGICAL may be too bold and this can be seen because Column 14 is not much higher up the taxonomy tree than column 13. When this happens, the conservation of the vector source needs to be propagated from genus_level_exclusions.txt to one of the higher-level files*: superkingdom_level_exclusions.txt kingdom_level_exclusions.txt phylum_level_exclusions.txt class_level_exclusions.txt order_level_exclusions.txt family_level_exclusions.txt tribe_level_exclusions.txt * Please email [email protected] if you find any examples of vector sources that should be propagated to a higher level of taxonomy. *********** **METHODS** *********** Given the input taxonomy file and an input sequence file in FASTA format, from_vecscreen_to_summary.pl will do the following: 1) Run vecscreen on the input FASTA-formatted sequence file to identify high-scoring matches to known vector sequences in UniVec in the input sequence file. 2) Parse the vecscreen output into two tab-delimited files for internal and terminal matches by calling parse_vecscreen.pl. 3) Optionally combine the two summary files into one by calling combine_summaries.pl. 4) Add taxonomy information to the vecscreen output by calling srcchk and add_taxonomy.pl. compare_vector_matches_wtaxa.pl uses the six sets of data files listed to classify each vecscreen match. This program is separate because an in-house usage needed somewhat different I/O specifications to fit into an existing software framework. *************** **TEST SCRIPT** *************** There is a 'test' script included in the vecscreen_plus_taxonomy distribution that you should run to make sure that everything is set up correctly, in addition to doing the two example sample runs above. To run the test script, execute the following command: > $VECPLUSDIR/scripts/test_vecscreen_plus_taxonomy_scripts.pl You should see the following output: -- Checking that required input files exist ... done. Testing combine_summaries.pl ... done. Testing add_taxonomy.pl ... done. Testing from_vecscreen_to_summary.pl ... done. Testing compare_vector_matches_wtaxa.pl ... done. # All tests passed. # SUCCESS -- If you do not see this output, make sure that you've set up your environment variable $VECPLUSDIR correctly as explained above in the 'SETTING UP ENVIRONMENT VARIABLES' section. If you still have problems, email [email protected]. ****************** **RELEVANT FILES** ****************** Several executable files are required for from_vecscreen_to_summary.pl and compare_vector_matches_wtaxa.pl to work. Two of these executables were developed by others and must be downloaded and installed separately outside of NCBI. All of these files are included in the vecscreen_plus_taxonomy github repository, so you do not need to create or move any files in order to get from_vecscreen_to_summary.pl to work. After cloning the git repository, these files will be in the PATH/TO/VEC/PLUS/scripts directory. The first two files are NCBI executable programs that were not authored by Alejandro Schaffer: 1. vecscreen and associated UniVec database Identifies vector contamination in input sequences. Within NCBI, vecscreen can be found at /netopt/ncbi_tools64/c++.stable/ReleaseMT/bin/vecscreen At NCBI, to add this directory to your path execute this command: > ln -s /netopt/ncbi_tools64/c++.stable/ReleaseMT/bin/vecscreen . For users outside NCBI, we provide a gzipped executable of vecscreen for 64-bit Linux computers in scripts/vecscreen.gz Run gunzip vecscreen.gz chmod +x vecscreen and make sure that vecscreen is on the execution path (which it should be if you followed the steps above in SETTING UP ENVIRONMENT VARIABLES) vecscreen requires that the BLASTable database UniVec be accessible. This means that these files must be in a directory that is contained in your $BLASTDB environment variable. If you followed the instructions in the SETTING UP ENVIRONMENT VARIABLES section above, you have already added the appropriate directory (PATH/TO/VEC/PLUS/univec-files) to $BLASTDB. The UniVec database is represented (for purposes of vecscreen) in the three files: univec-files/UniVec.nhr univec-files/UniVec.nin univec-files/UniVec.nsq Do not try to edit the three UniVec files under any circumstances. See also the section COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH 2. srcchk Determines the taxonomy of input sequences, with respect the NCBI taxonomy tree. Within NCBI, srcchk can be found at /netopt/ncbi_tools64/bin/srcchk At NCBI, to add this directory to your path execute this command: > ln -s /netopt/ncbi_tools64/bin/srcchk . For users outside NCBI, we provide a gzipped executable of srcchk for 64-bit Linux computers in scripts/srcchk.gz Run gunzip srcchk.gz chmod +x srcchk and make sure that srcchk is on the execution path (which it should be if you followed the steps above in SETTING UP ENVIRONMENT VARIABLES) See also the section COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH The next 6 files were authored by Alejandro Schaffer: 3. from_vecscreen_to_summary.pl This is the main script that coordinates the work by calling the other executables listed below. 4. parse_vecscreen.pl Auxiliary script called by from_vecscreen_to_summary.pl that parses the vecscreen output. 5. combine_summaries.pl Auxiliary script called by from_vecscreen_to_summary.pl that combines two different output formats of parse_vecscreen.pl 6. add_taxonomy.pl Auxiliary script called by from_vecscreen_to_summary.pl that adds taxonomy information to the output of parse_vecscreen.pl. 7. assign_levels_to_taxonomy.pl Independent script that is used to create a taxonomy file from the NCBI taxonomy. The file produced by assign_levels_to_taxonomy.pl is required input to from_vecscreen_to_summary.pl. 8. compare_vector_matches_wtaxa.pl Program to classify vecscreen matches that have already been parsed with from_vecscreen_to_summary.pl. Matches can be classified as: A. NO_DATA: there is no data about the source of the vector segment in the match. B. TRUE_ARTIFICIAL: the vector segment matched is an ARTIFICIAL sequence and hence the match is TRUE contamination. C. TRUE_ARTIFICIAL_MICROSAT: the vector segment matched is an ARTIFICIAL sequence and hence the match is TRUE contamination and the vector contains a microsatellite. D. FALSE_AMR: the query sequence is bacterial; the vector segment matches a known sequence that confers anti-microbial resistance and these can often be transferred horizontally between bacteria that may be taxonomically distant. E. FALSE_BIOLOGICAL: the subject's biological origin is known and its taxid is deemed close enough to that of the query, so that the match is not contamination. F. TRUE_BIOLOGICAL : the query and the matching subject originate from taxa that are too far apart for the vector to occur plausibly in the query. G. TRUE_MICROSAT : the query and the matching subject originate from taxa that are too far apart for the vector to occur plausibly in the query, and the vector contains a microsatellite. H. LIKE_FALSE_BACTERIAL: the query is from uncultured bacteria and the matching subject isfrom bacteria Additionally, several sets of data files are required for compare_vector_matches_wtaxa.pl to work. These were all included with this software distribution. After cloning the github repository, these files will be in the PATH/TO/VEC/PLUS/info-files directory. 1. taxonomy_tree_wlevels.txt A compact form of NCBI's taxonomy tree with added fields to indicate for each taxid, its level and whether it is a descendant of the node Bacteria. Descendants of Bacteria are treated specially because many sequences are now assigned to the generic taxid "Uncultured bacteria" 2. UniVec10_vs_amr_distinct_intervals.txt Intervals of vectors from UniVec version 10 that overlap with known antimicrobial resistance (AMR) regions. 3. artificial_intervals.txt, artificial_whole_sequences.txt Vector intervals or whole vectors that were generated in a laboratory, not from a biological source. Many of these are known also as "adaptors". 4. biological_exclusions.txt (which lists the following files) superkingdom_level_exclusions.txt kingdom_level_exclusions.txt phylum_level_exclusions.txt class_level_exclusions.txt order_level_exclusions.txt family_level_exclusions.txt tribe_level_exclusions.txt genus_level_exclusions.txt The last listed file genus_level_exclusions.txt describes the biological sources of vector segments at genus level. The other files summarize in silico sequence analysis that shows that some vector segments are conserved at seven taxonomic levels higher than genus. In these files, as well as artificial_intervals.txt: Column 1 is the vector segment using UniVec notation Columns 2 and 3 are the start and end of the interval Column 4 is the taxid Column 5 is either the Latin or English name for the taxid 5. Microsatellite_vectors.txt a list of vectors that contain microsatellites; Sequences that have known microsatellites and match to these vectors are classified specially as TRUE_MICROSAT or TRUE_ARTIFICAL_MICROSAT Addtionally, the files in the 'test-files' directory created when cloning the github repository all used by the scripts/test_vector_plus_taxonomy.pl script for testing that your installation and setup is working properly. Finally, the file epn-options.pm is a perl module authored by Eric Nawrocki to handle command line options. It will be in the top-level PATH/TO/VEC/PLUS directory after cloning the repository. ****************************** **CREATING THE TAXONOMY FILE** ****************************** This section describes how to create the taxonomy input file (passed in to from_vecscreen_to_summary.pl with the --input_taxa <s> option) using the NCBI taxonomy file. NCBI's taxonomy is available from directory ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/. We start from any of the files: taxdmp.tar.Z taxdump.tar.gz taxdmp.zip from which one can extract the file nodes.dmp. Then execute the following commands: > cut -f1,3,5 nodes.dmp > taxonomy_tree.txt > assign_levels_to_taxonomy.pl --input_taxa taxonomy_tree.txt --outfile taxonomy_tree_wlevels.txt taxonomy_tree.txt will have three columns: Column 1: taxid as an an integer Column 2: parent taxid as an an integer Column 3: rank (e.g., phylum) as a word Running assign_levels_to_taxonomy.pl will add -- a fourth column which is the level in the taxonomy tree, where the root has level 1 and each child taxid has a level one greater than the level of its parent taxid; -- a fifth column that is 1 if the node is a descendant of Bacteria and 0 if not. ****************************** **AN ADDITIONAL TAXONOMY HELPER PROGRAM** ****************************** The repository for vecscreen_plus_taxonomy also includes the helper program find_taxonomy_ancestors.pl The purpose of this program is to solve the following taxonomy-related problem. Given as input one or more accessions, what are the taxid ancestors of those accessions at some specified taxonomy level, such as order or class. Usage: find_taxonomy_ancestors.pl \ --input_summary <input file of identifiers> \ --input_taxa <input file with NCBI's taxonomy> \ --input_level <desired taxonomy level> \ --outfile <output file> The input has three columns: 1) accession 2) taxid of accession typically, at species level 3) taxonomy name of the taxid in column 2 The output repeats the input columns and adds a fourth column with the taxid of the ancestor at the specified level. If there is no ancestor at the specified level, or if the taxid is not recognized, then the value 1 is printed instead because 1 is the root of the taxonomy tree. find_taxonomy_ancestors.pl is not currently used within VecScreen_plus_taxonomy, but is used in a related project. ********************************************************* **COMPILING vecscreen AND srcchk ON LINUX, FROM SCRATCH** ********************************************************* We provide gzipped executables of vecscreen and srcchk for 64-bit Linux. These have been tested on two non-NCBI computers, but it is not possible to prove that they work on all 64-bit Linux computers that are sufficiently up to date. In this context, sufficiently up to date means that all libraries associated with gcc version 4.8 or higher are installed. The executables provided were compiled with gcc version 4.91 because that was the least recent version of gcc available among the versions >= 4.8. srcchk and vecscreen are part of the NCBI C++ toolkit, which can be downloaded at the time of writing from ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/ncbi_cxx--18_0_0.tar.gz The current version is 18.0.0, but the version may increase in the future. Modify the above address as necessary to get the current version. At this time, the retrieved file will be named ncbi_cxx--18_0_0.tar.gz. If the version retrieved is higher, replace all occurrences of 18_0_0 below accordingly. Run gunzip ncbi_cxx--18_0_0.tar.gz tar xvf ncbi_cxx--18_0_0.tar cd ncbi_cxx--18_0_0 In case the user wants to compile vecscreen and srcchk from scratch, we found that the following sets of steps works on Linux, but we caution that compiling the NCBI C++ toolkit is a moving target and instructions may need to change in the future. Let the token <GCC version> represent the gcc version number without the decimal points. For example, on the computer we used, the token would be 491. After the following four commands are run, the executables for srcchk and vecscreen should be found in the subirectory (of ncbi_cxx--18_0_0) called GCC<GCC version>-DebugMT64/bin. The commands below will create files that collectively take about 25Gb of disk space (as of current version). After the build is complete, you can copy only the srcchk and vecscreen binaries from GCC<GCC version>-DebugMT64/bin to PATH/TO/VEC/PLUS, or anywhere else you want, and delete the rest of the files, if desired. 1) ./configure --without-gui --without-internal --without-boost --with-bin-release --with-flat-makefile 2) cd GCC<GCC version>-Debug 3) make -C build -f Makefile.flat all_files 4) make -C build -f Makefile.flat app/ ******************************************* Send any comments or questions to Alejandro Schaffer ([email protected])
About
Run vecscreen and interpret and annotate output matches taking taxonomy into account
Resources
Stars
Watchers
Forks
Packages 0
No packages published