Locality-sensitive bucketing (LSB) functions generalize the widely-used locality-sensitive hashing (LSH) methods that are designed to be able to recognize similar but not necessarily identical sequences. Such functionalities are useful in many bioinformatics applications including homology detection, overlap graph construction, and phylogenetic tree reconstruction, especially when the sequencing data has a high error rate.
A
A subset of fixed-length sequences is said to be
git clone https://github.com/Shao-Group/lsbucketing.git
cd lsbucketing
make
-
To generate buckets for all length
$n$ sequences, run./assignBuckets.out n
wheren
is the length of the sequences. The results are written in a file namedbuckets-n.txt
. This program also verifies the correctness of the efficient algorithm that generates buckets for a specific sequence without a global counter. This algorithm is provided as the functionassignBuckets
insrc/assignBuckets.c
. -
To generate a
$(1,1)$ -guaranteed subset, run./genSampleD1.out n
wheren
is the length of the sequences. The results are written in a file namedn01.sample
. The first line of the file is the number of length-$n$ sequences in this subset, which equals to$4^{n-1}$ for the default alphabet {A, C, G, T}. The remaining lines each contains one sequence. -
./LSB-statistics.out
can be used to reproduce the experimental results in our paper. It takes three parameters:-
n
: integer, the length of the sequences. -
r
: integer, the radius of the neighborhood of a sequence. -
w|s
: char, the optionw
uses all the length-$n$ sequences as the bucketing set; the options
uses a$(1,1)$ -guaranteed subset. The program utilizes the efficient membership query functionisInSampleD1
defined inlib/util.h
so the$(1,1)$ -guaranteed subset does not need to be explicitly generated.
Results for three LSB functions are given in the paper, the corresponding parameters are
./LSB-statistics.out 20 1 w ./LSB-statistics.out 20 1 s ./LSB-statistics.out 20 2 s
The program writes to standard output which can be redirected
./LSB-statistics.out 20 1 w > output.txt &
-