Skip to content

Module: Alphabets

Hannes Hauswedell edited this page Mar 23, 2017 · 10 revisions

Possible Layout

alph/alphabet.hpp               // Alphabet concept, generic ostream, serialization
alph/alphabet_container.hpp     // generic alphabet traits for basic_string adaption and string->ostream adapters; container-serialization

alph/nucl/dna4.hpp                   // dna alphabet definition; alias from dna4 to dna
alph/nucl/dna4_container.hpp         // dna traits specialization; aliases for vector; literal

alph/nucl/dna5.hpp                   // plus N
alph/nucl/dna5_container.hpp

alph/nucl/nucl16.hpp                  // full IUPAC code (without gaps), U and T as distinct characters
alph/nucl/nucl16_container.hpp

alph/nucl/rna4.hpp                   // rna4 alphabet definition; alias from rna4 to rna; inherits dna4
alph/nucl/rna4_container.hpp         // n; aliases for vector; literal

alph/nucl/rna5.hpp                   // ...
alph/nucl/rna5_container.hpp

alph/nucl/conversion.hpp             // code for converting between differenct nucl alphabets and containers
alph/nucl/conversion_container.hpp   // code for converting containers; view implementation

alph/aminoacid.hpp
alph/aminoacid/aa27.hpp                   // amino acid (27 letter code)
alph/aminoacid/aa27_container.hpp

alph/aminoacid/aa10murphy.hpp             // murphy reduction (10 letter code)
alph/aminoacid/aa10murphy_container.hpp

alph/aminoacid/conversion.hpp             // code for converting between different amino acid alphabets and containers
alph/aminoacid/conversion_container.hpp   // code for converting containers; view implementations

alph/quality.hpp
alph/quality/phred.hpp              // phred quality scores

alph/gaps.hpp
alph/gaps/gapped_alphabet.hpp       // an alphabet that wraps another alphabet and adds a gap character
alph/gaps/gaps.hpp                  // a stand-alone 0-or-1 alphabet that can be included in a compound alphabet

alph/translation.hpp            // code for translating nucl -> amino acid

General notes

  • the rna* alphabets inherit corresponding dna* alphabets and just overwrite value_to_char static member
  • there will be an alias from dna4 to dna and rna4 to rna
  • quality is an independent alphabet (likely won't be implemented during the retreat)
  • there will be a compound_alphabet concept, where a character can consist of multiple characters, e.g. dna5 and phred; by default this will use two bytes, but bit-compressing containers may/shall compress this to less than a byte
  • Behavior of Iupac wildcards (e.g., N matches or mismatches {A,C,G,T}) will be controlled by a comparison functor to be provided (optionally) to algorithms (for alignment or pattern matching).
  • we support general containers like std::vector. std::string will work, but only in a limited fashion (and isn't recommended)

Open Questions

  • should the default dna be dna4 or dna16?

Discussion and design decisions

There are two general designs one can pick, either save the numeric ranks of each alphabet letter internally, i.e. 0, 1, 2, 3 for dna (or a an enum with these values) = the "rank approach". OR save the actual char values internally, i.e. 64, 66 .... the "char like" approach.

In any case we would want the type to be POD and be somewhat usable in both general containers (e.g. std::vector) and in std::basic_string.

rank approach

pro:

  • no conversion when working with rank which is what you do in indexing and alignment and all serious work-loads
  • default initialized alphabet character has valid alphabet value, because 0 is a valid alphabet value

con:

  • no initialization from char, because no user defined constructors (user defined assignment works, but would be confusing to have one and not the other)
  • implicit conversion to char possible, but implicit conversion to numeric value makes more sense; explicit conversion to char ok, this implies table lookup
  • can be used in basic_strings, but some things are broken, e.g. the basic_string's char_traits' strlen function and possibly some other things [because they expect a 0 terminator which is a member of our alphabet]

char-like approach

pro:

  • reading as char is free, because the internal char needs no conversion
  • can be assigned and constructed from char, although this implies a narrowing conversion to the actual alphabet
  • easier conversion between alphabets, needs only n tables for n alphabets, not nΒ²
  • can be used in basic_strings and with no problem; better compatibility with c-strings

con:

  • getting the rank (or ord-value) implies a table lookup which may be expensive in the long run, especially since this is used very often
  • default initialized character has value 0 which is outside the alphabet

result

We decided on the rank approach for performance reasons. If required, char-like-approach alphabets can be added later on.

Prototype implementation

translation

enum struct translation_frames : uint8_t
{


};

enum struct genetic_code : uint8_t
{


};

void translate(output, input);

template <typename execution_policy_type,
          typename output_type,
          typename input_type>
    requires sequence_concept<output> && std::is_same_v<typename output::value_type, aa27>
void translate(execution_policy, output, input, genetic_code, frames);

void translate(execution_policy, ...)
{
    translate(...);
}
Clone this wiki locally