Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow queries for human genome #13

Open
simonepignotti opened this issue May 1, 2018 · 16 comments
Open

Slow queries for human genome #13

simonepignotti opened this issue May 1, 2018 · 16 comments
Labels

Comments

@simonepignotti
Copy link
Member

simonepignotti commented May 1, 2018

Querying human reads to the human genome (ref) is very slow (~500 RPM, a 1000x slowdown compared to querying bacterial reads to the same index).

Technical reason

This is due to highly repeted k-mers in the index, causing very large portions of the SA to be checked by the function bwa_sa2pos. Re-assembling with prophyle_assembler before creating the index brings the speed up to 25k RPM.

The number of calls to each function, obtained with gprof while querying 1000 simulated reads with (_compact) and without (_repeat) pre-assembly, are attached.
human_compact.pdf
human_repeat.pdf

Suggested solutions

Since the main use of this would be read filtering, one possible solution is to implement a new command prophex filter which stops querying k-mers as soon as a given threshold of k-mer hits is reached.

Furthermore, we could only check that the SA interval is non-empty instead of retrieving the position, but this would create some false positives due to k-mers on the border of two contigs.
Alternatively, once the interval is computed, we can stop retrieving the positions in the text as soon as one k-mer is verified using bwa2pos.

@salikhov-kamil
Copy link
Member

@simonepignotti so the output of prophex filter should be exactly the same as for prophex query, the only difference is that we can stop querying k-mers if threshold is reached?

@salikhov-kamil
Copy link
Member

concerning second idea: I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes. It will cover your case, when there is only one node, but possibly can improve speed in other situations too

@simonepignotti
Copy link
Member Author

simonepignotti commented May 2, 2018

the output of prophex filter should be exactly the same as for prophex query, the only difference is that we can stop querying k-mers if threshold is reached?

The best would be to output the reads passing the filter, and discard those which don't. It would not make much sense to use the normal format I guess, since we won't have the information about all k-mers and the output would be incomplete and useless. The threshold should be a float 0 <= t <= 1, and all reads with #matches/len > t should appear in the output.

I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes

I think it's a good idea. @karel-brinda , is there any reason why this could go wrong?

EDIT: even if we implement your suggestion, I think we should keep checking for contigs' borders. I imagine the situation of using prophex to index the output unitigs of BCalm or other DBG tools, which can be very fragmented. That would produce a lot of false positives...

@karel-brinda
Copy link
Member

karel-brinda commented May 2, 2018

The best would be to output the reads passing the filter

We need to keep all reads, even those that don't pass.

It would not make much sense to use the normal format I guess

I think it would be good to keep the normal (Kraken-like) format and just slightly adjust it. For instance, one of the Kraken columns is for classified / unclassified – this would perfectly serve our purposes.

I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes

I think it's a good idea. @karel-brinda , is there any reason why this could go wrong?

I don't completely understand why we can do that. Could you please provide some additional explanation?

@simonepignotti
Copy link
Member Author

simonepignotti commented May 2, 2018

I think it would be good to keep the standard format

In this case we also need to provide a script to perform the actual filtering (retrieve the name of "classified" reads in Prophex' output and output the raw reads). What about doing both at the same time, instead? Normal output in one file, filtered reads in another one.

Could you please provide some additional explanation?

If I understood correctly, once every node (or ref) has been matched, it is useless to continue querying the index for the same k-mer (the output list won't change, since it already includes every node). When there is only one node, as in the case of the human genome, this means to stop querying the index for that k-mer after its first occurrence, which is exactly what we want.
In the case of an index with multiple nodes, this would not affect the correctness of the results, and it could eventually (but very rarely, and only when there are few nodes) speed up the query.
@salikhov-kamil is that correct?

@salikhov-kamil
Copy link
Member

yes, you are right, Simone

@karel-brinda
Copy link
Member

@simonepignotti Do you remember what was the k-mer size?

@karel-brinda
Copy link
Member

A quick update: the problem are not repetitive k-mers, but something else.

@karel-brinda
Copy link
Member

@simonepignotti How did you make the gprof plots? I would like to do the same also with bwa fastmap to see what's so different.

@karel-brinda
Copy link
Member

karel-brinda commented Apr 21, 2020

I did extensive testing and profiling.

1) Some reads are much worse than others. Some reads take several seconds to be processed by ProPhex (time is in seconds):

hg38@chr1_9000644_9001048_0:0:0_0:0:0_d3f] 6.230
hg38@chr1_144281574_144282209_2:0:0_0:0:0_3c8] 5.491
hg38@chr1_13983825_13984340_0:0:0_1:0:0_177] 5.460
hg38@chr1_109221543_109222069_1:0:0_0:0:0_b92] 5.379
hg38@chr1_67477732_67478302_3:0:0_1:0:0_1262] 4.180
hg38@chr1_10982750_10983187_0:0:0_5:0:0_e17] 4.055
hg38@chr1_202922392_202922823_1:0:0_1:0:0_c19] 3.927
hg38@chr1_168456983_168457470_1:0:0_0:0:0_ae8] 3.920
hg38@chr1_56701931_56702372_0:0:0_2:0:0_332] 3.811
...

Example:

hg38@chr1_45044284_45044790_0:0:0_0:0:0_5] 2.102

>hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTATATATATATATTTATATATATATATATATATATATATATATTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTGGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCGATCTCGGCTCACTGCAAGCTCCGCCTTCCGGGTTCAC

2) Within the bad reads some k-mers are worse than others

Timing

39@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.000 sec
40@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.000 sec
41@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.000 sec
42@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
43@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.002 sec
44@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.048 sec
45@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.056 sec
46@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.065 sec
47@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.075 sec
48@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.075 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.084 sec
50@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.002 sec
51@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
52@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
53@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
54@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec
55@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.001 sec

K-mers:

>39@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
ATATATTTTTTTTTTTTTTTTTTTTTTTTGA
>40@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TATATTTTTTTTTTTTTTTTTTTTTTTTGAG
>41@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
ATATTTTTTTTTTTTTTTTTTTTTTTTGAGA
>42@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TATTTTTTTTTTTTTTTTTTTTTTTTGAGAC
>43@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
ATTTTTTTTTTTTTTTTTTTTTTTTGAGACG
>44@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTTTTGAGACGG
>45@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTTTGAGACGGA
>46@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTTGAGACGGAG
>47@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTTGAGACGGAGT
>48@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTTGAGACGGAGTC
>49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTTGAGACGGAGTCT
>50@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTTGAGACGGAGTCTG
>51@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTTGAGACGGAGTCTGG
>52@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTTGAGACGGAGTCTGGC
>53@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTTGAGACGGAGTCTGGCT
>54@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTTGAGACGGAGTCTGGCTC
>55@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5/1
TTTTTTTTTTTTTGAGACGGAGTCTGGCTCT

When queried in isolation:

49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.324 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.088 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.083 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.085 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.082 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.081 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.081 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.080 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.083 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.086 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.086 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.087 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.094 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.098 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.109 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.106 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.104 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.112 sec
49@hg38@chr1_45044284_45044790_0:0:0_0:0:0_5     0.120 sec

3) ProPhex vs BWA fastmap profiling with this k-mer (queried 100x)

ProPhex

 time   seconds   seconds    calls  ms/call  ms/call  name    
 86.58      6.90     6.90  2164600     0.00     0.00  bwt_sa
 12.92      7.93     1.03 67140800     0.00     0.00  bwt_occ
  0.25      7.95     0.02  2164600     0.00     0.00  bns_pos2rid
  0.13      7.96     0.01  2164600     0.00     0.00  bwa_sa2pos
  0.13      7.97     0.01  2164600     0.00     0.00  get_node_from_contig
  0.00      7.97     0.00     3100     0.00     0.00  bwt_2occ
  0.00      7.97     0.00      455     0.00     0.00  add_contig
  0.00      7.97     0.00      294     0.00     0.00  err_fread_noeof
  0.00      7.97     0.00      204     0.00     0.00  realtime
  0.00      7.97     0.00      202     0.00     0.00  ks_getuntil2
  0.00      7.97     0.00      200     0.00     0.00  strncat_with_check
  0.00      7.97     0.00      200     0.00     0.00  wrap_strdup
  0.00      7.97     0.00      116     0.00     0.00  wrap_malloc
  0.00      7.97     0.00      102     0.00     0.00  kseq_read

BWA fastmap

 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00     6000     0.00     0.00  bwt_occ4
  0.00      0.00     0.00     3000     0.00     0.00  bwt_2occ4
  0.00      0.00     0.00     3000     0.00     0.00  bwt_extend
  0.00      0.00     0.00      910     0.00     0.00  wrap_strdup
  0.00      0.00     0.00      294     0.00     0.00  err_fread_noeof
  0.00      0.00     0.00      202     0.00     0.00  fread_fix
  0.00      0.00     0.00      201     0.00     0.00  ks_getuntil2
  0.00      0.00     0.00      200     0.00     0.00  err_fputc
  0.00      0.00     0.00      200     0.00     0.00  err_printf
  0.00      0.00     0.00      200     0.00     0.00  err_puts
  0.00      0.00     0.00      200     0.00     0.00  smem_next
  0.00      0.00     0.00      101     0.00     0.00  kseq_read
  0.00      0.00     0.00      100     0.00     0.00  bwt_smem1a
  0.00      0.00     0.00      100     0.00     0.00  smem_set_query
  0.00      0.00     0.00       15     0.00     0.00  wrap_calloc

Graph for ProPhex:

image

@karel-brinda
Copy link
Member

Profiling results for ProPhex vs. BWA fastmap for the worst k-mer (queried 100x):
profile_fastmap_difficult_read_1_worst_kmer.complete.pdf
profile_prophex_difficult_read_1_worst_kmer.complete.pdf

@karel-brinda
Copy link
Member

karel-brinda commented Apr 21, 2020

Suffix array for the k-mer from above (TTTTTTTTTTTTTTTTTTTGAGACGGAGTCT):

[prophex:calculate_sa_interval .1] i=0 c=3 k=4539333982 l=6418572210 l-k=1879238228
[prophex:calculate_sa_interval .1] i=1 c=1 k=2761983427 l=3209286105 l-k=447302678
[prophex:calculate_sa_interval .1] i=2 c=3 k=5198118155 l=5339020178 l-k=140902023
[prophex:calculate_sa_interval .1] i=3 c=2 k=4323085135 l=4347238055 l-k=24152920
[prophex:calculate_sa_interval .1] i=4 c=0 k=1324382888 l=1331473074 l-k=7090186
[prophex:calculate_sa_interval .1] i=5 c=2 k=3491824866 l=3493578234 l-k=1753368
[prophex:calculate_sa_interval .1] i=6 c=2 k=3947466886 l=3948077429 l-k=610543
[prophex:calculate_sa_interval .1] i=7 c=1 k=2722352611 l=2722484491 l-k=131880
[prophex:calculate_sa_interval .1] i=8 c=0 k=830500363 l=830614701 l-k=114338
[prophex:calculate_sa_interval .1] i=9 c=2 k=3377186124 l=3377290508 l-k=104384
[prophex:calculate_sa_interval .1] i=10 c=0 k=1009703143 l=1009802683 l-k=99540
[prophex:calculate_sa_interval .1] i=11 c=2 k=3415289782 l=3415380212 l-k=90430
[prophex:calculate_sa_interval .1] i=12 c=3 k=5404768974 l=5404853733 l-k=84759
[prophex:calculate_sa_interval .1] i=13 c=3 k=6076954737 l=6077036093 l-k=81356
[prophex:calculate_sa_interval .1] i=14 c=3 k=6284397221 l=6284475869 l-k=78648
[prophex:calculate_sa_interval .1] i=15 c=3 k=6363431038 l=6363506407 l-k=75369
[prophex:calculate_sa_interval .1] i=16 c=3 k=6392481913 l=6392551992 l-k=70079
[prophex:calculate_sa_interval .1] i=17 c=3 k=6403378710 l=6403442947 l-k=64237
[prophex:calculate_sa_interval .1] i=18 c=3 k=6408038539 l=6408097103 l-k=58564
[prophex:calculate_sa_interval .1] i=19 c=3 k=6410415243 l=6410469074 l-k=53831
[prophex:calculate_sa_interval .1] i=20 c=3 k=6411930558 l=6411981473 l-k=50915
[prophex:calculate_sa_interval .1] i=21 c=3 k=6413051658 l=6413100140 l-k=48482
[prophex:calculate_sa_interval .1] i=22 c=3 k=6413932009 l=6413978095 l-k=46086
[prophex:calculate_sa_interval .1] i=23 c=3 k=6414668720 l=6414712340 l-k=43620
[prophex:calculate_sa_interval .1] i=24 c=3 k=6415299831 l=6415340952 l-k=41121
[prophex:calculate_sa_interval .1] i=25 c=3 k=6415840952 l=6415879019 l-k=38067
[prophex:calculate_sa_interval .1] i=26 c=3 k=6416302381 l=6416337123 l-k=34742
[prophex:calculate_sa_interval .1] i=27 c=3 k=6416692460 l=6416723634 l-k=31174
[prophex:calculate_sa_interval .1] i=28 c=3 k=6417019522 l=6417047105 l-k=27583
[prophex:calculate_sa_interval .1] i=29 c=3 k=6417293252 l=6417317753 l-k=24501
[prophex:calculate_sa_interval .1] i=30 c=3 k=6417523807 l=6417545452 l-k=21645

This corresponds to the true number of occurrences of this k-mer:

Karels-iMac:hg_oneline karel$ cat one_line_hg38.fna | grep -o TTTTTTTTTTTTTTTTTTTGAGACGGAGTCT | wcl
   10789
Karels-iMac:hg_oneline karel$ cat one_line_hg38.fna | grep -o AGACTCCGTCTCAAAAAAAAAAAAAAAAAAA | wcl
   10857

Sum = 21646

This means that bwa fastmap must be using some tricks to highly optimize the sa -> pos translation.

@simonepignotti
Copy link
Member Author

simonepignotti commented Apr 23, 2020

@simonepignotti Do you remember what was the k-mer size?

I am pretty sure it was k=25, but the tests for other values of k were comparable

@simonepignotti How did you make the gprof plots? I would like to do the same also with bwa fastmap to see what's so different.

I used https://github.com/jrfonseca/gprof2dot but I see you figured it out, sorry for the late reply!

This means that bwa fastmap must be using some tricks to highly optimize the sa -> pos translation.

I am sure we can implement a similar trick, but if I remember correctly the filter command implemented by @salikhov-kamil in https://github.com/prophyle/prophex/tree/kamil/prophex_filter should skip the sa -> pos translation when the only information needed is whether the index contains a k-mer or not. It's a different problem, but may be useful for some specific use cases like filtering human reads.
@karel-brinda, have you tried to compare prophex filter to bwa fastmap?

EDIT: see #16 and #22 for more details about the filter command

@karel-brinda
Copy link
Member

karel-brinda commented May 1, 2020

Now I've finished a profiling experiment with simplitigs. Unfortunately, I couldn't reproduce the results from the plot, where all 3 modes had almost the same times (here, the rolling window is the fastest as it should be).

I took hg38-simplitigs31.fa.gz from https://zenodo.org/record/3770419, prepended hg38@ to sequence names, took 100,000 Illumina 150 bp read, and profiled fastmap, prophex-rolling, and prophex-restarted.

Fastmap:

[main] Version: 0.7.15-r1142-dirty
[main] CMD: ../prophex/src/bwa/bwa fastmap -l 31 ../hg38-simplitigs31.fa.gz ../reads100x.31mers.fa
[main] Real time: 371.220 sec; CPU: 352.387 sec

Prophex-rolling

[prophex:query] match time: 28.87 sec
[prophex::query] Processed 100000 reads in 28.522 CPU sec, 28.871 real sec

Prophex-rolling

[prophex:query] match time: 15.44 sec
[prophex::query] Processed 100000 reads in 15.227 CPU sec, 15.440 real sec

I'm attaching the profile outputs:

profile_fastmap_reads100x.complete.pdf
profile_fastmap_reads100x.partial.pdf
profile_prophex-res_reads100x.complete.pdf
profile_prophex-res_reads100x.partial.pdf
profile_prophex-rol_reads100x.complete.pdf
profile_prophex-rol_reads100x.partial.pdf

@karel-brinda
Copy link
Member

karel-brinda commented May 2, 2020

The most important observation from today:

Even with simplitigs, there are k-mers for which querying is super slow. Again, it's due to very slow SA->master string coordinate conversion, due to a too large SA interval, no idea why. For instance, the k-mer TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT has a suffix array interval of length 12878 (see the attachment).

I have absolutely no idea how this can be possible. The experiment was done with the simplitig index, i.e., this k-mer is present only once in the FASTA. Moreover, even if I merge all the simplitigs into a single one, this k-mer still appears only once (i.e., it's not due to fake k-mers on the borders).

@salikhov-kamil Kamil, do you have any idea about what could be going on? I'm attaching a detailed ProPhex log.

hardest_kmer_SAs.txt

@karel-brinda
Copy link
Member

Interestingly, 32mers with all possible extension are matched in the k=32 mode exactly as they should be (31T_ACTG.txt). The only explanation I can come up with is that BWA uses some weird stretches of sequences between contigs. @salikhov-kamil What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants