-
Notifications
You must be signed in to change notification settings - Fork 82
Module: Alphabets
Hannes Hauswedell edited this page Mar 23, 2017
·
10 revisions
alph/alphabet.hpp // Alphabet concept, generic ostream, serialization
alph/alphabet_container.hpp // generic alphabet traits for basic_string adaption and string->ostream adapters; container-serialization
alph/nucl/dna4.hpp // dna alphabet definition; alias from dna4 to dna
alph/nucl/dna4_container.hpp // dna traits specialization; aliases for vector; literal
alph/nucl/dna5.hpp // plus N
alph/nucl/dna5_container.hpp
alph/nucl/nucl16.hpp // full IUPAC code (without gaps), U and T as distinct characters
alph/nucl/nucl16_container.hpp
alph/nucl/rna4.hpp // rna4 alphabet definition; alias from rna4 to rna; inherits dna4
alph/nucl/rna4_container.hpp // n; aliases for vector; literal
alph/nucl/rna5.hpp // ...
alph/nucl/rna5_container.hpp
alph/nucl/conversion.hpp // code for converting between differenct nucl alphabets and containers
alph/nucl/conversion_container.hpp // code for converting containers; view implementation
alph/aminoacid.hpp
alph/aminoacid/aa27.hpp // amino acid (27 letter code)
alph/aminoacid/aa27_container.hpp
alph/aminoacid/aa10murphy.hpp // murphy reduction (10 letter code)
alph/aminoacid/aa10murphy_container.hpp
alph/aminoacid/conversion.hpp // code for converting between different amino acid alphabets and containers
alph/aminoacid/conversion_container.hpp // code for converting containers; view implementations
alph/quality.hpp
alph/quality/phred.hpp // phred quality scores
alph/gaps.hpp
alph/gaps/gapped_alphabet.hpp // an alphabet that wraps another alphabet and adds a gap character
alph/gaps/gaps.hpp // a stand-alone 0-or-1 alphabet that can be included in a compound alphabet
alph/translation.hpp // code for translating nucl -> amino acid
- the rna* alphabets inherit corresponding dna* alphabets and just overwrite
value_to_char
static member - there will be an alias from
dna4
todna
andrna4
torna
- quality is an independent alphabet (likely won't be implemented during the retreat)
- there will be a
compound_alphabet
concept, where a character can consist of multiple characters, e.g.dna5
andphred
; by default this will use two bytes, but bit-compressing containers may/shall compress this to less than a byte - Behavior of Iupac wildcards (e.g., N matches or mismatches {A,C,G,T}) will be controlled by a comparison functor to be provided (optionally) to algorithms (for alignment or pattern matching).
- we support general containers like
std::vector
.std::string
will work, but only in a limited fashion (and isn't recommended)
- should the default
dna
bedna4
ordna16
?
There are two general designs one can pick, either save the numeric ranks of each alphabet letter internally, i.e. 0
, 1
, 2
, 3
for dna
(or a an enum with these values) = the "rank approach".
OR save the actual char values internally, i.e. 64
, 66
.... the "char like" approach.
In any case we would want the type to be POD and be somewhat usable in both general containers (e.g. std::vector
) and in std::basic_string
.
pro:
- no conversion when working with rank which is what you do in indexing and alignment and all serious work-loads
- default initialized alphabet character has valid alphabet value, because 0 is a valid alphabet value
con:
- no initialization from char, because no user defined constructors (user defined assignment works, but would be confusing to have one and not the other)
- implicit conversion to char possible, but implicit conversion to numeric value makes more sense; explicit conversion to char ok, this implies table lookup
- can be used in basic_strings, but some things are broken, e.g. the basic_string's char_traits' strlen function and possibly some other things [because they expect a 0 terminator which is a member of our alphabet]
pro:
- reading as char is free, because the internal char needs no conversion
- can be assigned and constructed from char, although this implies a narrowing conversion to the actual alphabet
- easier conversion between alphabets, needs only n tables for n alphabets, not nΒ²
- can be used in basic_strings and with no problem; better compatibility with c-strings
con:
- getting the rank (or ord-value) implies a table lookup which may be expensive in the long run, especially since this is used very often
- default initialized character has value 0 which is outside the alphabet
We decided on the rank approach for performance reasons. If required, char-like-approach alphabets can be added later on.
enum struct translation_frames : uint8_t
{
};
enum struct genetic_code : uint8_t
{
};
void translate(output, input);
template <typename execution_policy_type,
typename output_type,
typename input_type>
requires sequence_concept<output> && std::is_same_v<typename output::value_type, aa27>
void translate(execution_policy, output, input, genetic_code, frames);
void translate(execution_policy, ...)
{
translate(...);
}