-
Notifications
You must be signed in to change notification settings - Fork 82
Index Revamp
cpockrandt wrote:
I recently talked to Knut about generalized suffix arrays / FM indices. Indices in SeqAn2 use a sentinel substitute by default (i.e., the alphabet size is not increased). For
Strings
, the position (unsigned) of the sentinel is stored, forStringSets
a bitvector with the sentinel positions is stored (see methods_createBwt()
inindex_fm_lf_table.h
).When rank queries are performed with the character that is the sentinel substitute (the character of the original alphabet that occurs the least number of times in the text), it is corrected (by looking at the unsigned, resp. performing rank queries on the sentinel bitvector).
I think SeqAn3 at the moment implements indices on StringSets by introducing a sentinel into the text prior to indexing, and computing the locations afterwards. Especially for genomic datasets (only a few long DNA sequences), it is probably better to use implicit sentinels and use a bitvector / sparse bitvector to store the sentinel positions.
This would require to change the implementation of rank queries (
bwt.rank()
), otherwise one might find pseudo-matches that span multiple sequences since they are separated by a sentinel substitute. This should be fairly easy doable in SeqAn3 since we already have logic moved out into private functions such asbackward_search
).The more challenging part is the BWT construction. One would have to build the SA on the string with sentinels and replace the sentinels with its substitutes when building the BWT. Otherwise the SA sorting would go wrong without a custom SA-indexing algorithm as we did in SeqAn2 that handles sentinels implicitly. From my perspective it would be cleaner to leave the SA index construction algorithm as it is (so one can use libdivsufsort or any other algorithm) and change the BWT construction similar to what we do in SeqAn2 (see
_createBwt
).
Knut suggested: https://www.sciencedirect.com/science/article/pii/S0304397512001211