Skip to content

Module: Alphabets

Hannes Hauswedell edited this page Mar 14, 2017 · 10 revisions

Possible Layout

alph/alphabet.hpp               // Alphabet concept, generic ostream, serialization
alph/alphabet_container.hpp     // generic alphabet traits for basic_string adaption and string->ostream adapters; container-serialization

alph/nucl/dna4.hpp                   // dna alphabet definition; alias from dna4 to dna
alph/nucl/dna4_container.hpp         // dna traits specialization; aliases for vector; literal

alph/nucl/dna5.hpp                   // plus N
alph/nucl/dna5_container.hpp

alph/nucl/nucl16.hpp                  // full IUPAC code (without gaps), U and T as distinct characters
alph/nucl/nucl16_container.hpp

alph/nucl/rna4.hpp                   // rna4 alphabet definition; alias from rna4 to rna; inherits dna4
alph/nucl/rna4_container.hpp         // n; aliases for vector; literal

alph/nucl/rna5.hpp                   // ...
alph/nucl/rna5_container.hpp

alph/nucl/conversion.hpp             // code for converting between differenct nucl alphabets and containers
alph/nucl/conversion_container.hpp   // code for converting containers; view implementation

alph/aminoacid.hpp
alph/aminoacid/aa27.hpp                   // amino acid (27 letter code)
alph/aminoacid/aa27_container.hpp

alph/aminoacid/aa10murphy.hpp             // murphy reduction (10 letter code)
alph/aminoacid/aa10murphy_container.hpp

alph/aminoacid/conversion.hpp             // code for converting between different amino acid alphabets and containers
alph/aminoacid/conversion_container.hpp   // code for converting containers; view implementations

alph/quality.hpp
alph/quality/phred.hpp              // phred quality scores

alph/gaps.hpp
alph/gaps/gapped_alphabet.hpp       // an alphabet that wraps another alphabet and adds a gap character
alph/gaps/gaps.hpp                  // a stand-alone 0-or-1 alphabet that can be included in a compound alphabet

alph/translation.hpp            // code for translating nucl -> amino acid

General notes

  • the rna* alphabets inherit corresponding dna* alphabets and just overwrite value_to_char static member
  • there will be an alias from dna4 to dna and rna4 to rna
  • quality is an independent alphabet (likely won't be implemented during the retreat)
  • there will be a compound_alphabet concept, where a character can consist of multiple characters, e.g. dna5 and phred; by default this will use two bytes, but bit-compressing containers may/shall compress this to less than a byte
  • we support general containers like std::vector. std::string will work, but only in a limited fashion (and isn't recommended)

Open Questions

  • should the default dna be dna4 or dna16?

Discussion and design decisions

There are two general designs one can pick, either save the numeric ranks of each alphabet letter internally, i.e. 0, 1, 2, 3 for dna (or a an enum with these values) = the "rank approach". OR save the actual char values internally, i.e. 64, 66 .... the "char like" approach.

In any case we would want the type to be POD and be somewhat usable in both general containers (e.g. std::vector) and in std::basic_string.

rank approach

pro:

  • no conversion when working with rank which is what you do in indexing and alignment and all serious work-loads
  • default initialized alphabet character has valid alphabet value, because 0 is a valid alphabet value

con:

  • no initialization from char, because no user defined constructors (user defined assignment works, but would be confusing to have one and not the other)
  • implicit conversion to char possible, but implicit conversion to numeric value makes more sense; explicit conversion to char ok, this implies table lookup
  • can be used in basic_strings, but some things are broken, e.g. the basic_string's char_traits' strlen function and possibly some other things [because they expect a 0 terminator which is a member of our alphabet]

char-like approach

pro:

  • reading as char is free, because the internal char needs no conversion
  • can be assigned and constructed from char, although this implies a narrowing conversion to the actual alphabet
  • easier conversion between alphabets, needs only n tables for n alphabets, not nΒ²
  • can be used in basic_strings and with no problem; better compatibility with c-strings

con:

  • getting the rank (or ord-value) implies a table lookup which may be expensive in the long run, especially since this is used very often
  • default initialized character has value 0 which is outside the alphabet

result

we decided on the rank approach for performance reasons.

Prototype implementation

Clone this wiki locally