Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible exon numbering #4471

Merged
merged 8 commits into from
Dec 9, 2024
Merged

Flexible exon numbering #4471

merged 8 commits into from
Dec 9, 2024

Conversation

faithokamoto
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • When reading a .gff3 file with vg rna, validate exon ordering by base-pair position instead of number attribute. This allows reverse-strand exons to be numbered either by base-pair order or transcription order.

Description

When parsing a .gff3 file several validation checks are done. One check is for whether the exons are out of order. This check is inflexible on exon numbering: all exons must be numbered in the order they appear in the file. This pull request relaxes the ordering requirement to only require base-pair order. Reverse-strand exons may appear in transcription order or base-pair order, as before.

This change is backwards compatible. All transcripts successfully parsed previously will still be parsed.

Currently, within a transcript, exons numbers (as determined by attribute) must increase as the parser reads down the file. However, genes on the reverse complement strand may have their exons numbered by order of transcription instead of position. Given a file where exons are sorted by position within a transcript (as is common), this means the parser will encounter the exons in reverse order, e.g. seeing exon 4, 3, 2, and 1. Such a transcript will be excluded as having "incorrect exon order". Therefore, all reverse-strand genes using transcription-order numbering are excluded.

A secondary, related bug affects forward-strand genes. Currently, to determine if a given exon is in order, its number attribute is compared to the total number of exons parsed so far for its transcript. This comparison must consider whether exons use 0-based or 1-based numbering. The parser decides on the numbering system by checking if the first parsed exon is number 0 or not. If not, then the file is assumed to use 1-based numbering. However, if the file uses 0-based numbering, but the first exon is from a reverse-strand transcript with multiple exons, then its exon number will be greater than 0. With the wrong numbering system, even forward-strand exon numbers will be off by exactly one. In this case, all forward-strand genes may be excluded on the basis of a faulty assumption.

This pull request bypasses both bugs by simply ignoring exon number. Correct order is still validated, but via base-pair position. Forward-strand exons must be in increasing base-pair order, whereas reverse-strand exons may be in either base-pair or transcription order. The pre-existing function reorder_exons() flips a list of reverse-strand exons if they are in reverse order. A new function, has_incorrect_order_exons(), performs the exon-order validation. Following the precedent of has_overlapping_exons(), this validation is performed after all transcripts have been parsed and thus all exons are known. Finally, because by the point of validation all exons may be assumed to be in increasing order (reorder_exons() is used before validation), the code in has_overlapping_exons() is simplified to only deal with the case of increasing order.

Validate exon order by base-pair position instead of number attribute.
Comment on lines 298 to 302
/// Checks whether any adjacent exons are out of (strictly increasing) order
bool has_incorrect_order_exons(const vector<Exon> & exons) const;

/// Checks whether any adjacent exons overlap.
bool has_overlapping_exons(const vector<Exon> & exons) const;
Copy link
Member

@adamnovak adamnovak Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These both assume that the exons have already been through reorder_exons() and are now supposed to be in increasing coordinate order, right? Maybe the comments here should specify that so that people know how to call them properly?

I guess "out of (strictly increasing) order" is meant to convey that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put more comments anyways.

autoindex has difficulty with transcript ID, maybe this will fix it.
Now the one passing transcript is the second one. The first breaks because the exons are simply mixed up (even though their attribute numbers are correct), and the last breaks because forward strand genes can't be reversed.
@faithokamoto
Copy link
Contributor Author

I changed a unit test because it was enforcing the old logic.

actually break transcript2
transcript3 strand
@faithokamoto
Copy link
Contributor Author

Unit test changed as so:

  • transcript1 now fails not because the exon numbers are out of order, but because the exon positions are out of order. This is consistent with my new logic and also makes more biological sense. What we actually care about is the order in which exons are located in order to draw edges between them, not what humans numbered them afterwards. RNA pol II moves in base pair position order.
  • transcript2 now fails because its exon positions are out of order. This test is distinct from transcript1 because it is on the reverse stand.
  • transcript3 succeeds as before
  • transcript4 fails because its exons are in reverse order and it is a forward strand gene. This behavior stays the same as before.

@adamnovak adamnovak merged commit d5859e1 into master Dec 9, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants