Index Revamp

cpockrandt wrote:

I recently talked to Knut about generalized suffix arrays / FM indices. Indices in SeqAn2 use a sentinel substitute by default (i.e., the alphabet size is not increased). For Strings, the position (unsigned) of the sentinel is stored, for StringSets a bitvector with the sentinel positions is stored (see methods _createBwt() in index_fm_lf_table.h).

When rank queries are performed with the character that is the sentinel substitute (the character of the original alphabet that occurs the least number of times in the text), it is corrected (by looking at the unsigned, resp. performing rank queries on the sentinel bitvector).

I think SeqAn3 at the moment implements indices on StringSets by introducing a sentinel into the text prior to indexing, and computing the locations afterwards. Especially for genomic datasets (only a few long DNA sequences), it is probably better to use implicit sentinels and use a bitvector / sparse bitvector to store the sentinel positions.

This would require to change the implementation of rank queries (bwt.rank()), otherwise one might find pseudo-matches that span multiple sequences since they are separated by a sentinel substitute. This should be fairly easy doable in SeqAn3 since we already have logic moved out into private functions such as backward_search).

The more challenging part is the BWT construction. One would have to build the SA on the string with sentinels and replace the sentinels with its substitutes when building the BWT. Otherwise the SA sorting would go wrong without a custom SA-indexing algorithm as we did in SeqAn2 that handles sentinels implicitly. From my perspective it would be cleaner to leave the SA index construction algorithm as it is (so one can use libdivsufsort or any other algorithm) and change the BWT construction similar to what we do in SeqAn2 (see _createBwt).

Knut suggested: https://www.sciencedirect.com/science/article/pii/S0304397512001211

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Revamp

Getting started

Guidelines

General Reading

Modern C++ and design patterns

Others

Other links

Clone this wiki locally