Goal is to encode AMRrules for the following types of AMR variants:
- Gene presence detected
- Amino acid substitution or insertion
- Nucleotide substitution or insertion
- Gene truncated (loss of function)
- Mutation in promoter region (substitution, deletion or insertion, including IS)
- Gene copy number changes
- Mutations in multi-copy genes (e.g. 23S rRNA)
- Low frequency variants (i.e. heterozygosity)
Where possible we aim to encode the mutations in a HGVS compliant way.
It was considered that all examples submitted could be adequately addressed using a combination of ‘gene’, ‘mutation’ (based on HGVS syntax, with some modifications) and ‘variation type’ (based on hAMRonization field 'Genetic Variation Type', with some additions).
Specific examples of each AMR variant are shown below, with proposed mutation syntax and variation types for each (note that other fields required for rule definition, like organism, refseq accession, context, PMID are not included here for simplicity, as they are not essential to illustrate how to define a specific kind of variation):
ID | gene | mutation | variation type | drug | category |
---|---|---|---|---|---|
KPN0001 | blaSHV | - | Gene presence detected | ampicillin | wt R |
KPN0002 | gyrA | p.Ser83Tyr | Protein variant detected | ciprofloxacin | nwt I |
KPN0003 | parC | p.Ser80Ile | Protein variant detected | ciprofloxacin | nwt I |
KPN0004 | ompK36 | c.25C>T | Nucleotide variant detected | meropenem | nwt S |
KPN0005 | ompK36 | p.114_115insGlyAsp | Protein variant detected | meropenem | nwt I |
KPN0006 | mgrB | p.(1_100) | Gene truncation detected | colistin | nwt R |
KPN0007 | qnr | - | Gene presence detected | ciprofloxacin | nwt I |
NGO0001 | mtrR | - | Inactivating mutation detected | macrolides | nwt R |
KPN0008 | mgrB | p.Glu30* | Protein variant detected | colistin | nwt R |
ECO0001 | ampC | c.-11C>T | Promoter variant detected | ceftriaxone | nwt R |
ECO0002 | ampC | c.-14_-13insGT | Promoter variant detected | ceftriaxone | nwt R |
ACI0001 | blaOXA-58 | c.(-35_1)ins[ISAba125:inv] | Promoter variant detected | ceftriaxone | nwt R |
NGO0002 | 23S rDNA | c.[2045A>G][3] | Nucleotide variant detected in multi-copy gene | azithromycin | nwt R |
ECO0003 | blaTEM | c.[3] | Gene copy number variant detected | piperacillin+tazobactam | nwt R |
MTC0001 | gyrA | p.[Ala94Gly][0.13] | Low frequency variant detected | ciprofloxacin | nwt R |
Syntax for ‘mutation’ column - follows HVGS, including:
- Gene and protein start sites are position 1 (there is no position 0)
- Ranges are specified using
x_y
; for insertions the coordinates are specified as inclusive_exclusive, otherwise ranges are inclusive_inclusive - Unknown ranges are specified with parentheses,
(x_y)
. E.g.p.(1_100)insGlyAsp
means an insertion of 2 amino acids (Gly and Asp) anywhere between codons 1 and 100 inclusive (as opposed to a replacement of amino acids 1 through 100 with GlyAsp, which would be expressed asp.1_100delinsGlyAsp
).- Coordinates are specified relative to the reference sequence of a protein (p) or coding sequence (c)
- Coordinates upstream of coding sequence are specified relative to the start site, with a hyphen, e.g.
c.-35
indicates 35 bp upstream - Mutations in protein and DNA are specified differently, e.g.
p.Ser83Tyr
: change to protein sequence from Ser to Tyr at codon 83c.25C>T
: change to nucleotide coding region from C to T at nucleotide position 25
- Stop codons are specified (in both DNA and protein variants) as
*
- Following IUPAC,
X
signifies any amino acid,N
signifies any DNA base ^
(caret) is used as "or", e.g.p.(Gly719Ala^Ser)
- The letters
inv
indicate the inverse (i.e. reverse complement) of a sequence - Repeat sequences are specified as
sequence[N]
whereN
is the number of copies of the repeat
- AMRrules requires amino acids be specified as three-letter codes (whereas HGVS allows single-letter or three-letter codes)
- In HGVS you must specify the reference sequence explicitly using a sequence accession, followed by
:
and then the mutation, e.g.NF000285.3:p.Gly238Ser
. In AMRrules the gene is specified in separate column/s (‘gene’, ‘refseq accession’, ‘ARO accession’) and should not be repeated in the mutation column. So the above rule should be coded as:- gene =
blaSHV
- refseq accession =
NF000285.3
- ARO accession =
ARO:3000015
- mutation =
p.Gly238Ser
- gene =
- In AMRrules, insertion sequences (IS) should be labelled with their IS name as per ISfinder, as many do not have their own sequence accessions in refseq. E.g. insertion of ISAba125 should be specified as
ins[ISAba125]
, and insertion in reverse orientation to the gene to which the rule applies should be specified asins[ISAba125:inv]
. - In AMRrules, rules intended to apply when a gene is present in a minimum of N copies can be specified using the
[N]
syntax to indicate the minimum repeat/copy number of the whole coding sequence, asc.[N]
.- Note this syntax does not convey any information about the location of the copies, i.e.
c.[2]
simply indicates that there are at least 2 copies of the gene detected in the genome, whether they are tandem repeats or in different replicons such as one in the chromosome and one in a plasmid.
- Note this syntax does not convey any information about the location of the copies, i.e.
- In HGVS, the presence of multiple alleles (i.e. heterozygous) is specified as a colon-separated list of allelic variants e.g.
[allele1];[allele2]
. - In AMRrules, rules that apply to variation in a multi-copy gene can be specified in this way, with each allele explicitly stated.
- Alternatively if the rule applies when a minimum of N copies of the gene carry the mutation (e.g. mutation in ≥3 copies of 23S rRNA resulting in resistance to azithromycin), this can be abbreviated using the
[N]
syntax to indicate the minimum repeat/copy number, asc.[allele][N]
orp.[allele][N]
, e.g.c.[2045A>G][3]
.
- Alternatively if the rule applies when a minimum of N copies of the gene carry the mutation (e.g. mutation in ≥3 copies of 23S rRNA resulting in resistance to azithromycin), this can be abbreviated using the
- In AMRrules, rules that apply to ‘low frequency variants’, i.e. when a minimum fraction of reads, P, support presence of the allelic variant in a sequenced population, the minimum fraction can be specified by extension of the syntax for copy number, as
[X]
. E.g.p.[Ala94Gly][0.13]
(example from the Mycobacterium tuberculosis gyrA gene).- To put another way, in AMRrules the repeat syntax
[X]
is interpreted as a minimum copy number ifX
is an integer, and as a minimum read fraction ifX
is a double/float between 0 and 1.
- To put another way, in AMRrules the repeat syntax
p.Ser83Tyr
: change to protein sequence from Ser to Tyr at codon 83
c.25C>T
: change to nucleotide coding region from C to T at nucleotide position 25
p.114_115insGlyAsp
: change to protein sequence, with an insertion of amino acids Gly and Asp between codons 114 and 115
p.(1_100)
: truncation (of any kind) anywhere in the first 100 amino acids of the protein sequence
c.-11C>T
: change to nucleotide sequence from C to T, 11 bases upstream of the start site for the gene.
c.-14_-13insGT
: insertion of nucleotides GT between positions -14 and -13, upstream of the start site of the gene
c.(-35_1)ins[ISAba125:inv]
: insertion of ISAba125, in reverse orientation (:inv), anywhere between 35 bases upstream of the start
site, and the start of the gene coding sequence
c.[2045A>G][3]
: substitution of A to G at position 2045 of the gene. This mutation must occur in minimum 3 copies
c.[3]
: gene needs to be present with a minimum of 2 copies
p.[Ala94Gly][0.13]
: protein variant is present in >13% of reads
Combinatorial rules are defined using logical expressions in the ‘gene’ column, where the objects of the expression are rule identifiers (‘ruleID’) that can be used as shorthand labels for the variants defined by ‘gene’:’mutation’ (‘variant type’) specified in the corresponding rules. The ‘variation type’ should be specified as ‘Combination’.
- Each rule must have a unique ‘ruleID’, assigned by the curating subgroup and prefixed with a 3-letter code that identifies the subgroup.
- E.g. in the table below,
KPN0008
can be used in a logical expression in the ‘gene’ column to demarcategyrA:p.Ser83Tyr
, andKPN0013
can be used to demarcateqnr (Gene presence detected)
. - So, the combination of these two variants can be specified as
KPN0008 & KPN0013
, which expands togyrA:p.Ser83Tyr & qnr (Gene presence detected)
.
Rules must be specified explicitly if the effect of the combination is NOT the same as the ‘most resistant’ (in terms of exceeding breakpoints, R > I > S; or deviation from wt, nwt > wt) predicted category of the component rules. E.g. in the table below:
- The individual rules
KPN0008
andKPN0009
solo each have expected category ‘nwt I’, but in combination we expect ‘nwt R’, so we need to specify the rule for the combinationKPN0008 & KPN0009
. - The expected category for genomes meeting rule
KPN0002
(i.e. carrying core gene oqxA, => wt S) in addition to ruleKPN0008
(i.e. with an acquired gyrA mutation, => nwt I) is nwt I. This is the same, not greater, than one of the component rules (KPN0008
) so we do not need to specify the combination explicitly.
Note this means the combination must be specified explicitly if the combined effect is LESS resistant than the ‘most resistant’ component, e.g. in this example from TB, deletion in one gene renders the resistance mutation in another gene irrelevant so the combination must be specified.
ID | gene | mutation | variation type | drug | category |
---|---|---|---|---|---|
KPN0002 | oqxA | - | Gene presence detected | ciprofloxacin | wt S |
KPN0008 | gyrA | p.Ser83Tyr | Protein variant detected | ciprofloxacin | nwt I |
KPN0009 | parC | p.Ser80Ile | Protein variant detected | ciprofloxacin | nwt I |
KPN0013 | qnr | - | Gene presence detected | ciprofloxacin | nwt I |
KPN0051 | KPN0008 & KPN0009 | - | Combination | ciprofloxacin | nwt R |
KPN0052 | (KPN0008 | KPN0009) & KPN0013 | - | Combination | ciprofloxacin | nwt R |