-
Notifications
You must be signed in to change notification settings - Fork 0
aaschaffer/generate_vecscreen_candidates
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
generate_vecscreen_candidates Author: Alejandro Schaffer Documentation help and code review: Eric Nawrocki github: https://github.com/aaschaffer/generate_vecscreen_candidates.git Version: 0.03 August 2017 -------------------------- README This file describes two scripts called generate_run_blast_scripts.pl and filter_vecscreen_candidates.pl whose combined purpose is to identify plausible candidate sequences that may have vector contamination. The general idea is that given a vector, we can first identify any sequence in a database that has a sufficiently strong match to the vector. Second, we can apply a series of filters to exclude sequences that either intrinsically are unlikely to be contaminated or extrinsically have been analyzed previously. Outline of this file: BACKGROUND REQUIREMENTS REQUIREMENTS AND RELEVANT FILES TESTING THAT YOUR LOCAL INSTALLATION IS WORKING USAGE AND OPTIONS METHODS OUTPUT ************** **BACKGROUND** ************** vecscreen is the established NCBI program to identify matches between (query) sequences and (subject) vectors in UniVec. These matches may represent (true) vector contamination, but experience has shown that there can be many false positives. 1) Location: Internal or Terminal A match is Terminal if and only if it includes a nucleotide within 25 positions of either end of the query 2) Strength: Strong, Moderate, or Weak A match is Strong if either: it is terminal with a raw score of at least 24 or it is internal with a raw score of at least 30. A match is Moderate if either: it is terminal with a raw score in the interval [19,23] or internal with a raw score in the interval [25,29]. A match is Weak if it is terminal with a raw score in the interval [16,18] or internal with a raw score in the interval [23,24]. vecscreen also reports internal alignments with raw scores in the range [16, 22] when there is also a reportable match for the same (query, vector) pair. The score range [16,22] is below the Weak range match for internal matches. In the script from_vecscreen_to_summary.pl, these below-Weak Internal matches are assigned the level None. More information about vecscreen can be found at https://www.ncbi.nlm.nih.gov/tools/vecscreen/about/ In this document 'vecscreen' refers to the command-line version, not the Web page version. At the stage of making alignments, vecscreen uses the same alignment algorithm as blastn. From this algorithmic information and the definition of a Weak match, it follows that a necessary condition for a sequence S to have a reportable vecscreen match with vector V, the two must have a local alignment of raw score at least 16. This is not a sufficient condition because if the alignment is internal in S, and the score is < 23, it will not be reported. Importantly the sufficient condition does not depend on the database size or the sequence lengths, and uses the default blastn scoring system. Therefore, one can find all sequences within a BLASTable database D that meet the necessary condition by using V as a query to D. Testing with nucleotide non-redundant database (called here nr, also known as nt) shows that even though the condition is not sufficient, far fewer than 1% of all sequences in nr satisfy the necessary condition for any vector V among the over 5000 vectors in UniVec. Because vectors share segements, it is often the case that if S satisfies the alignment condition with one vector V1, it will also satisfy the condition with other vectors V2, V3,... having the same or overlapping alignments. Using the reverse query technique and some filtering based on expert (NCBI colleagues Richard McVeigh and Ilene Karsch-Mizrachi) knowledge it is possible to reach a manageable size set of sequences that have a non-negligible probability of having true positive vecscreen matches. Such sequences outoutput by this pipeline are called "candidate sequences" or "candidates". *********************************** **REQUIREMENTS AND RELEVANT FILES** *********************************** **THE generate_run_blast_scripts.pl AND filter_vecscreen_candidates.pl **SCRIPTS CAN ONLY BE RUN FROM THE DIRECTORY IN WHICH THEY **EXIST. ALSO, THE CURRENT DIRECTORY (".") MUST BE IN YOUR PATH Do not move the scripts out of this directory. There are two programs that are *NOT* included in this distribution that are required to be in your execution path. Those are: 1. blastn Nucleotide BLAST At NCBI, this is probably here: /usr/bin/blastn Make sure that it is in your execution PATH 2. srcchk Determines the taxonomy of input sequences, with respect the NCBI taxonomy tree. At NCBI, this is probably here: /netopt/ncbi_tools64/bin/srcchk Make sure that it is in your execution PATH To add a symlink to this program in your current directory, execute this command: > ln -s /netopt/ncbi_tools64/bin/srcchk . 3. qsub Submits blast jobs to the compute farm at the end of 'generate_run_blast_scripts.pl'. If you're at NCBI, this should just work. If not, then your 'qsub' commands may fail due to specific syntax required for your system. If you have problems/questions please email Alejandro Schaffer ([email protected]). Additional executable files that are included in this distribution are required for the process of generating candidates to work within your environment. The full list of files is: 1. generate_run_blast_scripts.pl This script takes as inputs a list of vectors and a BLASTable database. It prepares scripts to run blastn with each vector as a query and submits those scripts to the SGE compute farm. 2. filter_vecscreen_candidates.pl This script collects the matching sequences found by generate_run_blast_scripts.pl and reduces them to a shorter list by filtering in a sequence of steps. As its last main step, this script generates a FASTA-formatted file of all sequences that survived filtering. 3. extract_acc_from_blast_outputs.pl This script extracts the accessions of matching sequences from the outputs of generate_run_blast_scripts.pl 4. select_shorten_accessions.pl This script selects the accessions that come from GenBank (as opposed to DDBJ or EMBL) and shortens the identifiers. 5. make_filter_script.pl This script helps apply Entrez filters to exclude sequences that meet any of a list of criteria specified as Entrez queries (see examples below) 6. filter_by_taxon.pl This script filters out sequences originating from any of a list of pre-specified taxids from NCBI's taxonomy. 7. eliminate_analyzed_acc.pl This script eliminates accessions that of sequences that were previously analyzed (list supplied by the user), to avoid duplication of effort. 8. get_fasta_from_acc.sh This script takes a lists of accessions as its first argument and retrieves the FASTA for those accessions into the file given by the second argument. 9. run_script_tests.pl This program takes zero arguments and tests the other programs in the directory using files in the testfiles subdirectory. 10. vecscreen_candidate_generation.pm This is a perl module of procedures shared by some of the .pl perl programs listed above. 11. epn-options.pm A perl used module authored by Eric Nawrocki to handle command line options. *************************************************** **TESTING THAT YOUR LOCAL INSTALLATION IS WORKING** *************************************************** The file 'run_script_tests.pl' included in this distribution should be used to make sure that your scripts and local installation is working as expected. To do this test, go into the directory in which you unpacked the distribution and do: > perl run_script_tests.pl The script will run tests of both generate_run_blast_scripts.pl and filter_vecscreen_candidates.pl, and compare the output with expected output. The final line of output should be: SUCCESS: all output files were as expected. If that is true, then you should be able to run the scripts successfully on your system. If this is not the case, please email Alejandro Schaffer ([email protected]) One aspect of the scripts that is *NOT* tested by run_script_tests.pl is that the 'qsub' command generated in generate_run_blast_scripts.pl will correctly submit to your compute farm. To test this aspect, you will actually have to run generate_run_blast_scripts.pl without using the --wait option. ********************* **USAGE AND OPTIONS** ********************* The options that can be provided to generate_run_blast_scripts.pl are: --input <s> : Input list <s> of fasta files --outprefix <s> : Prefix <s> of names of output files --db <s> : database <s> for blastn (default nr) [nr] --verbose : verbose mode: output commands as they are executed --wait : do not submit jobs, just create qsub script and exit The first two options are required. The other three options are not required. By default --verbose and --wait are off. --input is used to specify the vectors; there should be one vector per FASTA file in the list; the order in which the vectors are listed is unimportant; there is currently no check for duplicates. A snippet of the --input file could look like this: ../AB009864.2:1386-1506.na ../AB009864.2:304-865.na ../AB009864.2:3344-3392-49.na ../AB009864.2:888-986.na ../AB013921.2:1-113.na ../AB013921.2:1852-1948.na ../AB013921.2:198-295.na Natural values for --outprefix are results blastresults blastmatches One should use a unique prefix for those output files, so that they can be easily collected by an ls command later. The options that can be provided to filter_vecscreen_candidates.pl are in two sets: basic options (required): --input_match_files <s> : File name <s> with list of files that contain BLAST matches between vectors and a database --input_filters_file <s> : File name <s> with Entrez queries to use as filters --input_tax_exclusion_file <s> : File name <s> with taxids to exclude --output <s> : Output file name <s> with candidate sequences in FASTA format other options (not required): --input_exclude_accessions <s> : input name <s> of accessions to exclude --verbose : be verbose in output --keep : keep all intermediate files (e.g. vecscreen output) Some sample filters for --input_filters_file 1:10000 [SLEN] NOT "gbdiv syn" [PROP] NOT "transposon" [TITL] NOT (retroviridae[ORGN] AND ("LTR"[TITL] OR "long terminal repeat"[TITL])) The taxids in --input_tax_exclusion_file are integer taxids from NCBI's taxonomy, one per row. The option --verbose determines how much diagnostic output is produced, while commands are being run. The option --keep determines whether the output file from running vecscreen within the script is kept (--keep on) or deleted (--keep off, default). *********** **METHODS** *********** generate_run_blast_scripts.pl finds all sequences in the specified db (-db) that may have a reportable vecscreen match, by characteristics of an alignment alone. The names of the output files from generate_run_blast_scripts.pl should be collected into a single file, by a UNIX command such as ls blast_results*.out > blast_results_files.txt The file of names, such as blast_results_files.txt is then used as the value of the argument --input_match_files to filter_vecscreen_candidates.pl The script filter_vecscreen_candidates.pl carries out the following main steps: 1) Extract the accession of the candidate sequences from all the results files. 2) Select those accessions that are from GenBank (not EMBL or DDBJ) and keep only the accession and version, not the GenBank identifier. 3) Apply all the Entrez filters specified in the --filters_file argument to get a reduced list of accessions. The full set of Entrez filters we are using is supplied as testfiles/filter_list_combined.txt and for testing purposes we provide a subset of that is in testfiles/input.filter_list.txt 4) Run the NCBI program srcchk to get taxonomy sources of all accessions surviving step 3. 5) Filter out any accessions coming from taxids excluded by the --input_tax_exclusion_file argument. 6) (Optional) If the argument --input_exclude_accessions is used, then exclude the accessions in that file. The purpose of this step is to avoid reanalyzing accessions that were already judged to be false positives. 7) Generate the FASTA file of all sequences whose accessions survive to be the input of this step. ********** **OUTPUT** ********** generate_run_blast_scripts.pl produces N output files, if there are N vector queries; each output file is in blastn format type 6 with the following columns: Column 1: Vector query accession Column 2: Matching sequence identifier including GenBank identifier (this is the candidate sequence that may be contaminated) Column 3: Percent identity of the alignment Column 4: Length of the alignment Columns 5,6: Mismatches and gaps Columns 7,8: Start and end positions in the vector query Columns 9,10: Start and end positions in the candidate sequence Column 11: E-value Column 12: Bitscore filter_vecscreen_candidates.pl produces one FASTA file containing all the sequences identified as candidates to be contaminated. This is the expected format for the next pipeline from_vecscreen_to_summary.pl (see https://github.com/aaschaffer/vecscreen_plus_taxonomy) ******************************************* Send any comments or questions to Alejandro Schaffer ([email protected])
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published