-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible exon numbering #4471
Flexible exon numbering #4471
Conversation
Validate exon order by base-pair position instead of number attribute.
src/transcriptome.hpp
Outdated
/// Checks whether any adjacent exons are out of (strictly increasing) order | ||
bool has_incorrect_order_exons(const vector<Exon> & exons) const; | ||
|
||
/// Checks whether any adjacent exons overlap. | ||
bool has_overlapping_exons(const vector<Exon> & exons) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These both assume that the exons have already been through reorder_exons()
and are now supposed to be in increasing coordinate order, right? Maybe the comments here should specify that so that people know how to call them properly?
I guess "out of (strictly increasing) order" is meant to convey that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put more comments anyways.
autoindex has difficulty with transcript ID, maybe this will fix it.
Now the one passing transcript is the second one. The first breaks because the exons are simply mixed up (even though their attribute numbers are correct), and the last breaks because forward strand genes can't be reversed.
I changed a unit test because it was enforcing the old logic. |
transcript3 strand
Unit test changed as so:
|
Changelog Entry
To be copied to the draft changelog by merger:
.gff3
file withvg rna
, validate exon ordering by base-pair position instead of number attribute. This allows reverse-strand exons to be numbered either by base-pair order or transcription order.Description
When parsing a
.gff3
file several validation checks are done. One check is for whether the exons are out of order. This check is inflexible on exon numbering: all exons must be numbered in the order they appear in the file. This pull request relaxes the ordering requirement to only require base-pair order. Reverse-strand exons may appear in transcription order or base-pair order, as before.This change is backwards compatible. All transcripts successfully parsed previously will still be parsed.
Currently, within a transcript, exons numbers (as determined by attribute) must increase as the parser reads down the file. However, genes on the reverse complement strand may have their exons numbered by order of transcription instead of position. Given a file where exons are sorted by position within a transcript (as is common), this means the parser will encounter the exons in reverse order, e.g. seeing exon 4, 3, 2, and 1. Such a transcript will be excluded as having "incorrect exon order". Therefore, all reverse-strand genes using transcription-order numbering are excluded.
A secondary, related bug affects forward-strand genes. Currently, to determine if a given exon is in order, its number attribute is compared to the total number of exons parsed so far for its transcript. This comparison must consider whether exons use 0-based or 1-based numbering. The parser decides on the numbering system by checking if the first parsed exon is number 0 or not. If not, then the file is assumed to use 1-based numbering. However, if the file uses 0-based numbering, but the first exon is from a reverse-strand transcript with multiple exons, then its exon number will be greater than 0. With the wrong numbering system, even forward-strand exon numbers will be off by exactly one. In this case, all forward-strand genes may be excluded on the basis of a faulty assumption.
This pull request bypasses both bugs by simply ignoring exon number. Correct order is still validated, but via base-pair position. Forward-strand exons must be in increasing base-pair order, whereas reverse-strand exons may be in either base-pair or transcription order. The pre-existing function
reorder_exons()
flips a list of reverse-strand exons if they are in reverse order. A new function,has_incorrect_order_exons()
, performs the exon-order validation. Following the precedent ofhas_overlapping_exons()
, this validation is performed after all transcripts have been parsed and thus all exons are known. Finally, because by the point of validation all exons may be assumed to be in increasing order (reorder_exons()
is used before validation), the code inhas_overlapping_exons()
is simplified to only deal with the case of increasing order.