Flexible exon numbering #4471

faithokamoto · 2024-12-06T22:36:13Z

Changelog Entry

To be copied to the draft changelog by merger:

When reading a .gff3 file with vg rna, validate exon ordering by base-pair position instead of number attribute. This allows reverse-strand exons to be numbered either by base-pair order or transcription order.

Description

When parsing a .gff3 file several validation checks are done. One check is for whether the exons are out of order. This check is inflexible on exon numbering: all exons must be numbered in the order they appear in the file. This pull request relaxes the ordering requirement to only require base-pair order. Reverse-strand exons may appear in transcription order or base-pair order, as before.

This change is backwards compatible. All transcripts successfully parsed previously will still be parsed.

Currently, within a transcript, exons numbers (as determined by attribute) must increase as the parser reads down the file. However, genes on the reverse complement strand may have their exons numbered by order of transcription instead of position. Given a file where exons are sorted by position within a transcript (as is common), this means the parser will encounter the exons in reverse order, e.g. seeing exon 4, 3, 2, and 1. Such a transcript will be excluded as having "incorrect exon order". Therefore, all reverse-strand genes using transcription-order numbering are excluded.

A secondary, related bug affects forward-strand genes. Currently, to determine if a given exon is in order, its number attribute is compared to the total number of exons parsed so far for its transcript. This comparison must consider whether exons use 0-based or 1-based numbering. The parser decides on the numbering system by checking if the first parsed exon is number 0 or not. If not, then the file is assumed to use 1-based numbering. However, if the file uses 0-based numbering, but the first exon is from a reverse-strand transcript with multiple exons, then its exon number will be greater than 0. With the wrong numbering system, even forward-strand exon numbers will be off by exactly one. In this case, all forward-strand genes may be excluded on the basis of a faulty assumption.

This pull request bypasses both bugs by simply ignoring exon number. Correct order is still validated, but via base-pair position. Forward-strand exons must be in increasing base-pair order, whereas reverse-strand exons may be in either base-pair or transcription order. The pre-existing function reorder_exons() flips a list of reverse-strand exons if they are in reverse order. A new function, has_incorrect_order_exons(), performs the exon-order validation. Following the precedent of has_overlapping_exons(), this validation is performed after all transcripts have been parsed and thus all exons are known. Finally, because by the point of validation all exons may be assumed to be in increasing order (reorder_exons() is used before validation), the code in has_overlapping_exons() is simplified to only deal with the case of increasing order.

Validate exon order by base-pair position instead of number attribute.

adamnovak · 2024-12-06T22:44:29Z

src/transcriptome.hpp

+	/// Checks whether any adjacent exons are out of (strictly increasing) order
+	bool has_incorrect_order_exons(const vector<Exon> & exons) const;
+
        /// Checks whether any adjacent exons overlap.
        bool has_overlapping_exons(const vector<Exon> & exons) const;


These both assume that the exons have already been through reorder_exons() and are now supposed to be in increasing coordinate order, right? Maybe the comments here should specify that so that people know how to call them properly?

I guess "out of (strictly increasing) order" is meant to convey that?

I put more comments anyways.

autoindex has difficulty with transcript ID, maybe this will fix it.

Now the one passing transcript is the second one. The first breaks because the exons are simply mixed up (even though their attribute numbers are correct), and the last breaks because forward strand genes can't be reversed.

faithokamoto · 2024-12-07T00:50:32Z

I changed a unit test because it was enforcing the old logic.

actually break transcript2

transcript3 strand

faithokamoto · 2024-12-07T04:22:10Z

Unit test changed as so:

transcript1 now fails not because the exon numbers are out of order, but because the exon positions are out of order. This is consistent with my new logic and also makes more biological sense. What we actually care about is the order in which exons are located in order to draw edges between them, not what humans numbered them afterwards. RNA pol II moves in base pair position order.
transcript2 now fails because its exon positions are out of order. This test is distinct from transcript1 because it is on the reverse stand.
transcript3 succeeds as before
transcript4 fails because its exons are in reverse order and it is a forward strand gene. This behavior stays the same as before.

faithokamoto added 2 commits December 6, 2024 14:06

Make exon numbering more flexible

7ab810e

Validate exon order by base-pair position instead of number attribute.

new function for order validation

38095a5

adamnovak reviewed Dec 6, 2024

View reviewed changes

faithokamoto added 3 commits December 6, 2024 15:04

add assumption

e07d9b7

Restore original transcript ID parsing

49a537a

autoindex has difficulty with transcript ID, maybe this will fix it.

Update unit test to new logic

553cf20

Now the one passing transcript is the second one. The first breaks because the exons are simply mixed up (even though their attribute numbers are correct), and the last breaks because forward strand genes can't be reversed.

faithokamoto added 3 commits December 6, 2024 17:11

use original passing transcript

11904da

oops

105e586

actually break transcript2

oops take 2

9711c5a

transcript3 strand

adamnovak merged commit d5859e1 into master Dec 9, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible exon numbering #4471

Flexible exon numbering #4471

faithokamoto commented Dec 6, 2024

adamnovak Dec 6, 2024 •

edited

Loading

faithokamoto Dec 6, 2024

faithokamoto commented Dec 7, 2024

faithokamoto commented Dec 7, 2024

Flexible exon numbering #4471

Flexible exon numbering #4471

Conversation

faithokamoto commented Dec 6, 2024

Changelog Entry

Description

adamnovak Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

faithokamoto Dec 6, 2024

Choose a reason for hiding this comment

faithokamoto commented Dec 7, 2024

faithokamoto commented Dec 7, 2024

adamnovak Dec 6, 2024 •

edited

Loading