-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow queries for human genome #13
Comments
@simonepignotti so the output of prophex filter should be exactly the same as for prophex query, the only difference is that we can stop querying k-mers if threshold is reached? |
concerning second idea: I suggest to stop retrieving positions using sa2pos if number of matched nodes is equal to the total number of nodes. It will cover your case, when there is only one node, but possibly can improve speed in other situations too |
The best would be to output the reads passing the filter, and discard those which don't. It would not make much sense to use the normal format I guess, since we won't have the information about all k-mers and the output would be incomplete and useless. The threshold should be a float
I think it's a good idea. @karel-brinda , is there any reason why this could go wrong? EDIT: even if we implement your suggestion, I think we should keep checking for contigs' borders. I imagine the situation of using prophex to index the output unitigs of BCalm or other DBG tools, which can be very fragmented. That would produce a lot of false positives... |
We need to keep all reads, even those that don't pass.
I think it would be good to keep the normal (Kraken-like) format and just slightly adjust it. For instance, one of the Kraken columns is for classified / unclassified – this would perfectly serve our purposes.
I don't completely understand why we can do that. Could you please provide some additional explanation? |
In this case we also need to provide a script to perform the actual filtering (retrieve the name of "classified" reads in Prophex' output and output the raw reads). What about doing both at the same time, instead? Normal output in one file, filtered reads in another one.
If I understood correctly, once every node (or ref) has been matched, it is useless to continue querying the index for the same k-mer (the output list won't change, since it already includes every node). When there is only one node, as in the case of the human genome, this means to stop querying the index for that k-mer after its first occurrence, which is exactly what we want. |
yes, you are right, Simone |
@simonepignotti Do you remember what was the k-mer size? |
A quick update: the problem are not repetitive k-mers, but something else. |
@simonepignotti How did you make the gprof plots? I would like to do the same also with bwa fastmap to see what's so different. |
I did extensive testing and profiling. 1) Some reads are much worse than others. Some reads take several seconds to be processed by ProPhex (time is in seconds):
Example:
2) Within the bad reads some k-mers are worse than others Timing
K-mers:
When queried in isolation:
3) ProPhex vs BWA fastmap profiling with this k-mer (queried 100x) ProPhex
BWA fastmap
Graph for ProPhex: |
Profiling results for ProPhex vs. BWA fastmap for the worst k-mer (queried 100x): |
Suffix array for the k-mer from above (
This corresponds to the true number of occurrences of this k-mer:
This means that |
I am pretty sure it was k=25, but the tests for other values of k were comparable
I used https://github.com/jrfonseca/gprof2dot but I see you figured it out, sorry for the late reply!
I am sure we can implement a similar trick, but if I remember correctly the EDIT: see #16 and #22 for more details about the |
Now I've finished a profiling experiment with simplitigs. Unfortunately, I couldn't reproduce the results from the plot, where all 3 modes had almost the same times (here, the rolling window is the fastest as it should be). I took Fastmap:
Prophex-rolling
Prophex-rolling
I'm attaching the profile outputs: profile_fastmap_reads100x.complete.pdf |
The most important observation from today: Even with simplitigs, there are k-mers for which querying is super slow. Again, it's due to very slow SA->master string coordinate conversion, due to a too large SA interval, no idea why. For instance, the k-mer TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT has a suffix array interval of length 12878 (see the attachment). I have absolutely no idea how this can be possible. The experiment was done with the simplitig index, i.e., this k-mer is present only once in the FASTA. Moreover, even if I merge all the simplitigs into a single one, this k-mer still appears only once (i.e., it's not due to fake k-mers on the borders). @salikhov-kamil Kamil, do you have any idea about what could be going on? I'm attaching a detailed ProPhex log. |
Interestingly, 32mers with all possible extension are matched in the k=32 mode exactly as they should be (31T_ACTG.txt). The only explanation I can come up with is that BWA uses some weird stretches of sequences between contigs. @salikhov-kamil What do you think? |
Querying human reads to the human genome (ref) is very slow (~500 RPM, a 1000x slowdown compared to querying bacterial reads to the same index).
Technical reason
This is due to highly repeted k-mers in the index, causing very large portions of the SA to be checked by the function
bwa_sa2pos
. Re-assembling withprophyle_assembler
before creating the index brings the speed up to 25k RPM.The number of calls to each function, obtained with
gprof
while querying 1000 simulated reads with (_compact) and without (_repeat) pre-assembly, are attached.human_compact.pdf
human_repeat.pdf
Suggested solutions
Since the main use of this would be read filtering, one possible solution is to implement a new command
prophex filter
which stops querying k-mers as soon as a given threshold of k-mer hits is reached.Furthermore, we could only check that the SA interval is non-empty instead of retrieving the position, but this would create some false positives due to k-mers on the border of two contigs.
Alternatively, once the interval is computed, we can stop retrieving the positions in the text as soon as one k-mer is verified using
bwa2pos
.The text was updated successfully, but these errors were encountered: