-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Profiling of secondary mapping and read with no sequence in bam files #2369
Comments
Well, the solution cannot simply be to remove the above condition about query_sequence. Rather we should use pysam to retrieve the read's sequence from the primary alignment. I made a test case with a short contig that I duplicated and introduced some minor changes (to get SNVs and INDEL). Reads should map multiple times and we will have secondary mapping in the resulting bam. The test case consists of a CONTIGS.db and the mapping.bam (and bam.bai). Mapping done with minimap2 default parameters. |
Just for posterity, this is what does this: diff --git a/anvio/bamops.py b/anvio/bamops.py
index d70d046ae..39de3ac64 100644
--- a/anvio/bamops.py
+++ b/anvio/bamops.py
@@ -225,6 +225,13 @@ class BAMFileObject(pysam.AlignmentFile):
pysam.AlignmentFile.__init__(self)
+ # Fetch sequences for all primary alignments in the BAM file
+ self.primary_alignments = {}
+ for read in self.fetch():
+ # If this read is a primary alignment, store it in the dictionary
+ if not read.is_secondary:
+ self.primary_alignments[read.query_name] = read.query_sequence
+
def fetch_only(self, contig_name, start=None, end=None, *args, **kwargs):
"""A wrapper function for `bam.fetch()`.
@@ -249,6 +256,9 @@ class BAMFileObject(pysam.AlignmentFile):
"""
for read in self.fetch_only(contig_name, start, end, *args, **kwargs):
+ if read.is_secondary:
+ read.query_sequence = self.primary_alignments[read.query_name]
+
if read.is_unmapped or read.cigartuples is None or read.query_sequence is None:
# This read either has no associated cigar string or no query sequence. If cigar
# string is None, this means it did not align but is in the BAM file anyways, or the |
Wild guess: should we modify fetch_only and not fetch_and_trim? |
Also, functions likes get_short_reads_dict will contains duplicated reads if secondary mapping was present during the profiling. Will impact anvi-get-short-reads* type of command line. |
We can add a new flag to |
There is a memory happy way to solve this, and I will do that along with a flag. but now I want to commit these changes and see HOW MUCH LONGER do they take as you try to re-run the profiler for your figures, @FlorianTrigodet? |
Because the memory happy way will take even longer than that. |
I don't see the INDEL in contig 2 ~13kbp, any idea why? |
I've been trying to figure that out, but I think it will require me much more time than what I have at the moment. At least COVs / SNVs / SCVs / SAAVs will work. I will start a branch and will ping you. |
OK. https://github.com/merenlab/anvio/tree/secondary-alignment-tracking is ready for you to play :) |
There is an issue when I run it with a large bam file:
And here is the memory usage for that job:
While I COULD increase the memory, I don't think that's the solution here. |
I didn't have a chance to work on my solution to make it memory friendly yet. It will be there hopefully today perhaps. |
OK. A new PR at #2371 solves the memory issues. But it comes with the following note: In its current state, the PR is partially complete as there is a known bug in it that in some cases leads to the following error when profiling SNVs:
Solution for this requires insights into whether we should carry over more information into the (...)
for read in self.fetch(contig_name, start, end):
if self.secondary_alignments_tracked and read.is_secondary:
read.query_sequence = self.primary_alignments[read.query_name]
(...) We update query sequence, but probably we need to update one or more things to address the SNVs error. Until then, using the flag |
The need
I was profiling a bam file when I noticed that a massive amount of my reads were not used:
That number of reads corresponded to reads with the secondary mapping flag and minimap2 does not report the sequence for secondary mapping by default (to save some space I assume). BUT I WANT MY SECONDARY MAPPING TO BE IN MY ANVI'O PROFILE.
The solution
First, I found this in bamops.py:
And removing the check for read.query_sequence was not a problem for the profiler. But I don't know if there are unforeseen consequences for removing that condition: does snv calling need the read sequence? (apparently not since it just worked, but some expert can comment on it).
The solution could be to profile everything in the bam that is given and does take a decision on behalf of the user, and then we would have a fetch filter for only primary to skip the secondary mapping.
Beneficiaries
Often, you probably don't want to keep secondary mapping, i.e. reads mapping a second (or more) time with a alignment quality as good or within a threshold of the primary mapping. They are going to inflate coverage with potentially non-specific read recruitment in the case where the secondary mapping has a lower alignment score to the primary. BUT, the user should be able to choose to keep them or not. They may not realize that there bam contains such secondary mapping of various quality to the primary alignment (looking at you minimap2 - default threshold is 80% of primary alignment score), but they should be able to keep them or not.
The text was updated successfully, but these errors were encountered: