Skip to content

Latest commit

 

History

History
117 lines (89 loc) · 11.3 KB

syntax.md

File metadata and controls

117 lines (89 loc) · 11.3 KB

Purpose

Goal is to encode AMRrules for the following types of AMR variants:

  • Gene presence detected
  • Amino acid substitution or insertion
  • Nucleotide substitution or insertion
  • Gene truncated (loss of function)
  • Mutation in promoter region (substitution, deletion or insertion, including IS)
  • Gene copy number changes
  • Mutations in multi-copy genes (e.g. 23S rRNA)
  • Low frequency variants (i.e. heterozygosity)

Where possible we aim to encode the mutations in a HGVS compliant way.

‘mutation’ syntax and ‘variation type’

It was considered that all examples submitted could be adequately addressed using a combination of ‘gene’, ‘mutation’ (based on HGVS syntax, with some modifications) and ‘variation type’ (based on hAMRonization field 'Genetic Variation Type', with some additions).

Specific examples of each AMR variant are shown below, with proposed mutation syntax and variation types for each (note that other fields required for rule definition, like organism, refseq accession, context, PMID are not included here for simplicity, as they are not essential to illustrate how to define a specific kind of variation):

ID gene mutation variation type drug category
KPN0001 blaSHV - Gene presence detected ampicillin wt R
KPN0002 gyrA p.Ser83Tyr Protein variant detected ciprofloxacin nwt I
KPN0003 parC p.Ser80Ile Protein variant detected ciprofloxacin nwt I
KPN0004 ompK36 c.25C>T Nucleotide variant detected meropenem nwt S
KPN0005 ompK36 p.114_115insGlyAsp Protein variant detected meropenem nwt I
KPN0006 mgrB p.(1_100) Gene truncation detected colistin nwt R
KPN0007 qnr - Gene presence detected ciprofloxacin nwt I
NGO0001 mtrR - Inactivating mutation detected macrolides nwt R
KPN0008 mgrB p.Glu30* Protein variant detected colistin nwt R
ECO0001 ampC c.-11C>T Promoter variant detected ceftriaxone nwt R
ECO0002 ampC c.-14_-13insGT Promoter variant detected ceftriaxone nwt R
ACI0001 blaOXA-58 c.(-35_1)ins[ISAba125:inv] Promoter variant detected ceftriaxone nwt R
NGO0002 23S rDNA c.[2045A>G][3] Nucleotide variant detected in multi-copy gene azithromycin nwt R
ECO0003 blaTEM c.[3] Gene copy number variant detected piperacillin+tazobactam nwt R
MTC0001 gyrA p.[Ala94Gly][0.13] Low frequency variant detected ciprofloxacin nwt R

Syntax for ‘mutation’ column - follows HVGS, including:

  • Gene and protein start sites are position 1 (there is no position 0)
  • Ranges are specified using x_y; for insertions the coordinates are specified as inclusive_exclusive, otherwise ranges are inclusive_inclusive
  • Unknown ranges are specified with parentheses, (x_y). E.g. p.(1_100)insGlyAsp means an insertion of 2 amino acids (Gly and Asp) anywhere between codons 1 and 100 inclusive (as opposed to a replacement of amino acids 1 through 100 with GlyAsp, which would be expressed as p.1_100delinsGlyAsp).
    • Coordinates are specified relative to the reference sequence of a protein (p) or coding sequence (c)
  • Coordinates upstream of coding sequence are specified relative to the start site, with a hyphen, e.g. c.-35 indicates 35 bp upstream
  • Mutations in protein and DNA are specified differently, e.g.
    • p.Ser83Tyr: change to protein sequence from Ser to Tyr at codon 83
    • c.25C>T: change to nucleotide coding region from C to T at nucleotide position 25
  • Stop codons are specified (in both DNA and protein variants) as *
  • Following IUPAC, X signifies any amino acid, N signifies any DNA base
  • ^ (caret) is used as "or", e.g. p.(Gly719Ala^Ser)
  • The letters inv indicate the inverse (i.e. reverse complement) of a sequence
  • Repeat sequences are specified as sequence[N] where N is the number of copies of the repeat

Syntax for ‘mutation’ column - specific to AMRrules:

  • AMRrules requires amino acids be specified as three-letter codes (whereas HGVS allows single-letter or three-letter codes)
  • In HGVS you must specify the reference sequence explicitly using a sequence accession, followed by : and then the mutation, e.g. NF000285.3:p.Gly238Ser. In AMRrules the gene is specified in separate column/s (‘gene’, ‘refseq accession’, ‘ARO accession’) and should not be repeated in the mutation column. So the above rule should be coded as:
    • gene = blaSHV
    • refseq accession = NF000285.3
    • ARO accession = ARO:3000015
    • mutation = p.Gly238Ser
  • In AMRrules, insertion sequences (IS) should be labelled with their IS name as per ISfinder, as many do not have their own sequence accessions in refseq. E.g. insertion of ISAba125 should be specified as ins[ISAba125], and insertion in reverse orientation to the gene to which the rule applies should be specified as ins[ISAba125:inv].
  • In AMRrules, rules intended to apply when a gene is present in a minimum of N copies can be specified using the [N] syntax to indicate the minimum repeat/copy number of the whole coding sequence, as c.[N].
    • Note this syntax does not convey any information about the location of the copies, i.e. c.[2] simply indicates that there are at least 2 copies of the gene detected in the genome, whether they are tandem repeats or in different replicons such as one in the chromosome and one in a plasmid.
  • In HGVS, the presence of multiple alleles (i.e. heterozygous) is specified as a colon-separated list of allelic variants e.g. [allele1];[allele2].
  • In AMRrules, rules that apply to variation in a multi-copy gene can be specified in this way, with each allele explicitly stated.
    • Alternatively if the rule applies when a minimum of N copies of the gene carry the mutation (e.g. mutation in ≥3 copies of 23S rRNA resulting in resistance to azithromycin), this can be abbreviated using the [N] syntax to indicate the minimum repeat/copy number, as c.[allele][N] or p.[allele][N], e.g. c.[2045A>G][3].
  • In AMRrules, rules that apply to ‘low frequency variants’, i.e. when a minimum fraction of reads, P, support presence of the allelic variant in a sequenced population, the minimum fraction can be specified by extension of the syntax for copy number, as [X]. E.g. p.[Ala94Gly][0.13] (example from the Mycobacterium tuberculosis gyrA gene).
    • To put another way, in AMRrules the repeat syntax [X] is interpreted as a minimum copy number if X is an integer, and as a minimum read fraction if X is a double/float between 0 and 1.

Examples of ‘mutation’ syntax relevant to known AMR variants

p.Ser83Tyr: change to protein sequence from Ser to Tyr at codon 83

c.25C>T: change to nucleotide coding region from C to T at nucleotide position 25

p.114_115insGlyAsp: change to protein sequence, with an insertion of amino acids Gly and Asp between codons 114 and 115

p.(1_100): truncation (of any kind) anywhere in the first 100 amino acids of the protein sequence

c.-11C>T: change to nucleotide sequence from C to T, 11 bases upstream of the start site for the gene.

c.-14_-13insGT: insertion of nucleotides GT between positions -14 and -13, upstream of the start site of the gene

c.(-35_1)ins[ISAba125:inv]: insertion of ISAba125, in reverse orientation (:inv), anywhere between 35 bases upstream of the start site, and the start of the gene coding sequence

c.[2045A>G][3]: substitution of A to G at position 2045 of the gene. This mutation must occur in minimum 3 copies

c.[3]: gene needs to be present with a minimum of 2 copies

p.[Ala94Gly][0.13]: protein variant is present in >13% of reads

Combinatorial rules

Combinatorial rules are defined using logical expressions in the ‘gene’ column, where the objects of the expression are rule identifiers (‘ruleID’) that can be used as shorthand labels for the variants defined by ‘gene’:’mutation’ (‘variant type’) specified in the corresponding rules. The ‘variation type’ should be specified as ‘Combination’.

  • Each rule must have a unique ‘ruleID’, assigned by the curating subgroup and prefixed with a 3-letter code that identifies the subgroup.
  • E.g. in the table below, KPN0008 can be used in a logical expression in the ‘gene’ column to demarcate gyrA:p.Ser83Tyr, and KPN0013 can be used to demarcate qnr (Gene presence detected).
  • So, the combination of these two variants can be specified as KPN0008 & KPN0013, which expands to gyrA:p.Ser83Tyr & qnr (Gene presence detected).

Rules must be specified explicitly if the effect of the combination is NOT the same as the ‘most resistant’ (in terms of exceeding breakpoints, R > I > S; or deviation from wt, nwt > wt) predicted category of the component rules. E.g. in the table below:

  • The individual rules KPN0008 and KPN0009 solo each have expected category ‘nwt I’, but in combination we expect ‘nwt R’, so we need to specify the rule for the combination KPN0008 & KPN0009.
  • The expected category for genomes meeting rule KPN0002 (i.e. carrying core gene oqxA, => wt S) in addition to rule KPN0008 (i.e. with an acquired gyrA mutation, => nwt I) is nwt I. This is the same, not greater, than one of the component rules (KPN0008) so we do not need to specify the combination explicitly.

Note this means the combination must be specified explicitly if the combined effect is LESS resistant than the ‘most resistant’ component, e.g. in this example from TB, deletion in one gene renders the resistance mutation in another gene irrelevant so the combination must be specified.

ID gene mutation variation type drug category
KPN0002 oqxA - Gene presence detected ciprofloxacin wt S
KPN0008 gyrA p.Ser83Tyr Protein variant detected ciprofloxacin nwt I
KPN0009 parC p.Ser80Ile Protein variant detected ciprofloxacin nwt I
KPN0013 qnr - Gene presence detected ciprofloxacin nwt I
KPN0051 KPN0008 & KPN0009 - Combination ciprofloxacin nwt R
KPN0052 (KPN0008 | KPN0009) & KPN0013 - Combination ciprofloxacin nwt R