-
Notifications
You must be signed in to change notification settings - Fork 24
Mpox virus annotation
The steps below explain how to use VADR for MPXV validation and
annotation. These are only recommendations. Testing of VADR for MPXV
is still ongoing and these recommendations are subject to change,
possibly significantly. Many 'good' MPXV sequences that do not
contain any sequencing or assembly artifacts may 'fail'
v-annotate.pl
using the steps below.
GenBank is not currently using VADR to automatically process MPXV sequence submissions like it does for SARS-CoV-2. GenBank MPXV sequence submissions are still all manually reviewed, but indexers might use VADR as part of their analysis, using the command-line options listed in step 4 below.
WARNING: VADR is slow for long sequences like MPXV and in rare cases it can take 30 minutes or more to annotate some MPXV sequences that differ from the RefSeq NC_063383 in certain ways. The vast majority of sequences should take less than one minute.
Steps for using VADR for MPXV annotation:
-
Download and install the latest version of VADR (v1.5), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.5 docker image created by Curtis Kapsak (docker image names:
staphb/vadr:1.5
andstaphb/vadr:latest
), available on dockerhub and quay. A brief README for the docker image is here. -
Download the latest MPXV VADR models (version 1.4.2-1, gzipped tarball) from here, unpack them (e.g.
tar xfz <tarball.gz>
). Note the path to the directory name created (<mpxv-models-dir-path>
) for step 3. (If you are using the docker image you can skip this step as the MPXV VADR models are included.) -
WARNING: the
fasta-trim-terminal-ambigs.pl
script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.To remove terminal ambiguous nucleotides from your sequence file
<input-fasta-file>
and to remove short and long sequences to create a new trimmed file<trimmed-fasta-file>
, execute:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 210000 <input-fasta-file> > <trimmed-fasta-file>
-
Run the
v-annotate.pl
program on an input trimmed fasta file with MPXV sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).NOTE: The following command runs multithreaded on up to 4 CPUs, and so is only recommended if you have at least 4 CPUs and 16Gb RAM available. To run on
<n>
CPUs instead, replace--cpu 4
with--cpu <n>
. To run single threaded on a single CPU remove the--cpu 4
option. The--split
and--cpu
options are incompatible with-p
.
v-annotate.pl --split --cpu 4 --glsearch --minimap2 -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --s_overhang 150 --mdir <mpxv-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>
As of July 15, 2022, the VADR model library for Mpox annotation
(vadr-models-mpxv-1.4.2-1
model
file
includes a single MPXV model based on the
NC_063383
RefSeq sequence of length 197,209 nt.
This model was created using the following command with VADR v1.4.2 on July 13, 2022:
v-build.pl --forcelong -f --skipbuild NC_063383 NC_063383
Many of the output NC_063383 files were then renamed to use 'mpxv'
instead of 'NC_063383'. Additionally, the first line of mpxv.minfo
file was
changed from
MODEL NC_063383 blastdb:"NC_063383.vadr.protein.fa" length:"197209"
to
MODEL NC_063383 blastdb:"NC_063383.vadr.protein.fa" length:"197209" group:"Orthopoxvirus" subgroup:"Monkeypox_virus" indfstrn_exc:"1..6422:+;190788..197209:+;"
The indfstrn_exc
part of the addition makes is that an indfstrn
(indefinite strand) alert is not reported due to the repeated sequence
at the beginning and end of the mpox genome
(positions 1 to 6422 and the complement of positions 190788 to 197209).
This section shows output from an example v-annotate.pl
annotation of three
MPXV sequences from GenBank using the above command and options.
The fasta file of those three sequences can be downloaded here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the MPXV model directory is in /usr/local/vadr-models-mpxv-1.4.2-1
and the pretrim.mpxv.3.fa
sequence file is in the current directory. We will create a new file
mpxv.3.fa
with trimmed sequences in the next step.
As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 50nt or longer than 210,000nt with the command:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 210000 pretrim.mpxv.3.fa > mpxv.3.fa
Next, to annotate the trimmed sequences using the above
v-annotate.pl
options for MPXV, run the following command (scroll to the
right to see full command), which will create a new directory called
my3
into which VADR output files will be written.
v-annotate.pl --split --cpu 4 --glsearch --minimap2 -s -r --nomisc --mkey mpxv --r_lowsimok --r_lowsimxd 100 --r_lowsimxl 2000 --alt_pass discontn,dupregin --s_overhang 150 --mdir /usr/local/vadr-models-mpxv-1.4.2-1 mpxv.3.fa my3
The options used are explained further below.
When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.5 (Sep 2022)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Thu Sep 29 11:26:10 2022
# $VADRBIOEASELDIR: /usr/local/vadr-install-1.5/Bio-Easel
# $VADRBLASTDIR: /usr/local/vadr-install-1.5/ncbi-blast/bin
# $VADREASELDIR: /usr/local/vadr-install-1.5/infernal/binaries
# $VADRINFERNALDIR: /usr/local/vadr-install-1.5/infernal/binaries
# $VADRMINIMAP2DIR: /usr/local/vadr-install-1.5/minimap2
# $VADRMODELDIR: /usr/local/vadr-install-1.5/vadr-models-calici
# $VADRSCRIPTSDIR: /usr/local/vadr-install-1.5/vadr
#
# sequence file: mpxv.3.fa
# output directory: my3
# specify that alert codes in <s> do not cause FAILure: discontn,dupregin [--alt_pass]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': mpxv [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR: /usr/local/vadr-models-mpxv-1.4.2-1 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature: yes [--nomisc]
# align with glsearch from the FASTA package, not to a cm with cmalign: yes [--glsearch]
# use top-scoring HSP from blastn to seed the alignment: yes [-s]
# for -s, set length of nt overhang for subseqs to align to <n>: 150 [--s_overhang]
# use minimap2 to derive seed and use it if >= length of blastn seed: yes [--minimap2]
# replace stretches of Ns with expected nts, where possible: yes [-r]
# with -r, do not report lowsim{5,3,i}s within identified N-rich regions: yes [--r_lowsimok]
# w/--r_lowsimok, maximum length of N-rich region is <n>: 2000 [--r_lowsimxl]
# w/--r_lowsimok, max diff from expected length for N-rich region is <n>: 100 [--r_lowsimxd]
# split input file into chunks, run each chunk separately: yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch): 4 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Next, v-annotate.pl
will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.4 seconds]
# Splitting sequence file into chunks to run independently in parallel on 4 processors ... done. [ 0.5 seconds]
# Executing 3 scripts in parallel on 4 processors to process 3 partition(s) of all 3 sequence(s) ...
# 3 of 3 jobs finished (0.1 minutes spent waiting)
# done. [ 46.9 seconds]
# Merging and finalizing output ... done. [ 3.1 seconds]
With the --split --cpu 4
options, the input fasta script is split up
into chunks, in this case placing one sequence per chunk but typically
about 2 sequences per chunk for larger files, and runs
v-annotate.pl
separately on those chunks on 4 different CPUs in
parallel. When all sequences are finished processing the main script
merges the output together.
The v-annotate.pl
output then includes a summary of the classification of sequences, and the alerts reported:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- ------------- --------------- ---- ---- ----
1 NC_063383 Orthopoxvirus Monkeypox_virus 3 2 1
#--- --------- ------------- --------------- ---- ---- ----
- *all* - - 3 2 1
- *none* - - 0 0 0
#--- --------- ------------- --------------- ---- ---- ----
#
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- -------------------------- -------- ----- ---- -----------
1 dupregin no DUPLICATE_REGIONS sequence 73 3 similarity to a model region occurs more than once
2 discontn no DISCONTINUOUS_SIMILARITY sequence 3 3 not all hits are in the same order in the sequence and the homology model
3 ambgntrp no N_RICH_REGION_NOT_REPLACED sequence 4 3 N-rich region of unexpected length not replaced
#--- -------- ------- -------------------------- -------- ----- ---- -----------
4 indf3pst yes INDEFINITE_ANNOTATION_END feature 1 1 protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#--- -------- ------- -------------------------- -------- ----- ---- -----------
And finally a list of the output files created:
# Output printed to screen saved in: my3.vadr.log
# List of executed commands saved in: my3.vadr.cmd
# List and description of all output files saved in: my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in: my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in: my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in: my3.vadr.fail.tbl
# list of passing sequences saved in: my3.vadr.pass.list
# list of failing sequences saved in: my3.vadr.fail.list
# list of alerts in the feature tables saved in: my3.vadr.alt.list
# fasta file with passing sequences saved in: my3.vadr.pass.fa
# fasta file with failing sequences saved in: my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in: my3.vadr.sqa
# per-sequence tabular classification summary file saved in: my3.vadr.sqc
# per-feature tabular summary file saved in: my3.vadr.ftr
# per-model-segment tabular summary file saved in: my3.vadr.sgm
# per-alert tabular summary file saved in: my3.vadr.alt
# alert count tabular summary file saved in: my3.vadr.alc
# per-model tabular summary file saved in: my3.vadr.mdl
# alignment doctoring tabular summary file saved in: my3.vadr.dcr
# seed alignment summary file (-s) saved in: my3.vadr.sda
# replaced stretches of Ns summary file (-r) saved in: my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time: 00:00:49.90
# hh:mm:ss
#
[ok]
Note that all the output files will be in the newly created my3
directory.
The Summary of classified sequences
listed that two sequences passed and one failed.
The file my3.vadr.pass.list
, lists the two sequences that passed:
ON853669.1
ON843168.1
and my3.vadr.fail.list
lists the sequence that failed:
ON803440.1
Also, FASTA-formatted sequence files for each the passing and failing
sequences are my3.vadr.pass.fa
and my3.vadr.fail.fa
.
For the two sequences that passed, the annotation is available in the
output my3.vadr.pass.tbl
file and for the two sequences that failed the
annotation is in the my3.vadr.fail.tbl
file.
Here is the output for the first sequence in the my3.vadr.pass.tbl
file: (with the middle of the table for each sequence removed for
brevity)
>Feature ON853669.1
1572 832 gene
gene OPG001
1572 832 CDS
product Chemokine binding protein
protein_id ON853669.1_1
2748 1699 gene
gene OPG002
2748 1699 CDS
product Crm-B secreted TNF-alpha-receptor-like protein
protein_id ON853669.1_2
4604 2838 gene
gene OPG003
4604 2838 CDS
product Ankyrin repeat protein (25)
protein_id ON853669.1_3
...snip...
191058 192371 CDS
product Ankyrin repeat protein (39)
protein_id ON843168.1-like_176
192607 194373 gene
gene OPG003
192607 194373 CDS
product Ankyrin repeat protein (25)
protein_id ON843168.1-like_177
194463 195512 gene
gene OPG002
194463 195512 CDS
product Crm-B secreted TNF-alpha-receptor-like protein
protein_id ON843168.1-like_178
195639 196379 gene
gene OPG001
195639 196379 CDS
product Chemokine binding protein
protein_id ON843168.1-like_179
And the sequence in the my3.vadr.fail.tbl
(with the middle removed for brevity):
>Feature ON803440.1
1572 832 gene
gene OPG001
1572 832 CDS
product Chemokine binding protein
protein_id ON803440.1_1
2748 1699 gene
gene OPG002
2748 1699 CDS
product Crm-B secreted TNF-alpha-receptor-like protein
protein_id ON803440.1_2
4604 2838 gene
gene OPG003
4604 2838 CDS
product Ankyrin repeat protein (25)
protein_id ON803440.1_3
...snip...
192596 194362 CDS
product Ankyrin repeat protein (25)
protein_id ON803440.1_177
194452 195501 gene
gene OPG002
194452 195501 CDS
product Crm-B secreted TNF-alpha-receptor-like protein
protein_id ON803440.1_178
195628 196368 gene
gene OPG001
195628 196368 CDS
product Chemokine binding protein
protein_id ON803440.1_179
Additional note(s) to submitter:
ERROR: INDEFINITE_ANNOTATION_END: (CDS:CPXV166 protein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [30>8, valid stop codon in nucleotide-based prediction]; seq-coords:140165..140194:+; mdl-coords:140205..140205:+; mdl:NC_063383;
Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.
Note that the end of the fail.tbl
file lists ERRORs for
ON803440.1
. In this case, the 3' end of the CPXV166 protein is not
as similar as expected with the NC_063383 RefSeq protein. This
doesn't mean that there is a problem with a sequence, just that this
is one possible difference with the RefSeq that might indicate a
problem. It also might represent normal sequence divergence in the
virus. Many of the VADR ERRORs are meant to highlight potential
problems for manual review or regions of interest to the user.
The annotation information is also available in other files with
different formats, such as the my3/my3.vadr.ftr
file, which may be
easier to parse for some applications.
For examples of alerts/errors and more information on how to interpret the
VADR output related to those alerts in the output feature tables, .alt.list
file and the GenBank submission portal detailed error report .tsv
files, see this vadr documentation page.
See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.
You can find information on the VADR 1.0 paper and a submitted manuscript that describes some of the acceleration strategies that VADR uses for viruses with relatively long genomes (below).
The options used in the above command are the recommended set of options for MPXV annotation. These options are currently used by GenBank indexers when they run VADR for validating and annotating incoming MPXV sequence submissions. (All submissions are currently manually reviewed, no automated processing/approval using VADR is in place for MPXV sequences at this time.) The options are each briefly explained in the table below. More information can be found here.
option | explanation |
---|---|
--split |
split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
|
--cpu 4 |
for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 4 CPU workers (4 can be changed to <n1> with --cpu <n1> ), requires --split
|
--glsearch |
use the glsearch program from the FASTA package for the alignment stage, required for MPXV annotation |
-s |
turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment, required for MPXV annotation |
-r |
turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible |
--minimap2 |
use the minimap2 program to derive seeds and use them instead of blastn-based seeds if they are longer |
--nomisc |
specify that features for failing sequences not be changed to misc_features in the output .tbl file |
--mkey mpxv |
use the model files with prefix mpxv in the directory from --mdir
|
--r_lowsimok |
do not report LOW_SIMILARITY* errors due to N-rich regions |
--r_lowsimxl 2000 |
for --r_lowsimok , LOW_SIMILARITY* errors will not be reported only for low similarity regions of at most 2000nt |
--r_lowsimxd 100 |
for --r_lowsimok , LOW_SIMILARITY* errors will not be reported only for low similarity regions in which the difference in expected length compared to the model RefSeq is at most 100nt |
--alt_pass discont,dupregin |
specify that discontn and dupregion alerts that occur due to repetitive regions, which are common in MPXV sequences, do NOT cause a sequence to fail |
--s_overhang 150 |
specify that the number of nucleotides overlap between the seed and flanking regions aligned with glsearch be 150 nt |
--mdir /usr/local/vadr-models-mpxv-1.4.2-1 |
specify that the models to use are in directory /usr/local/vadr-models-mpxv-1.4.2-1 |
The -r
option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see
this page
The -s
option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see
this page
- VADR README
- VADR installation instructions
-
v-build.pl
example usage and command-line options -
v-annotate.pl
example usage, command-line options and alert information -
Explanations and examples of
v-annotate.pl
detailed alert and error messages- Output fields with detailed alert and error messages
- Explanation of sequence and model coordinate fields in
.alt
files toy50
toy model used in examples of alert messages- Examples of different alert types and corresponding
.alt
output - Posterior probability annotation in VADR output Stockholm alignments
- VADR output file formats
- Using VADR for SARS-CoV-2 annotation
-
The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
-
A submitted manuscript describing changes made to VADR for faster SARS-CoV-2 annotation which are also applied to MPXV sequences in the sequences above: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. bioRxiv (2022). https://doi.org/10.1101/2022.04.25.489427