-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1664 +/- ##
=======================================
Coverage 72.25% 72.25%
=======================================
Files 79 79
Lines 8276 8320 +44
Branches 1691 1707 +16
=======================================
+ Hits 5980 6012 +32
- Misses 2011 2014 +3
- Partials 285 294 +9 ☔ View full report in Codecov by Sentry. |
Sorry for the noise closing, I typed the wrong |
Until this PR, it was hard to use augur ancestral with a genome annotation gff3 from a Nextclade dataset for multiple reasons: - The GFF reader in augur looks for "gene" features only, while Nextclade 3 uses primarily CDS (gene only for backwards compatibility) - Augur's GFF reader doesn't support compound features - It takes the "gene"/"feature" name in the order "gene" > "gene_name" > "locus_tag", which differs from Nextclade's order This meant that users had to create a Genbank file, just for the purpose of running augur ancestral. That was doable but tedious. This PR solves the problem by adding a new GFF reader mode that is designed to be as compatible with Nextclade as possible. That means: - It looks for CDS features (for simplicity it ignores Nextclade's gene backwards compatibility) - It supports compound features - It takes the feature name in the exact same order as Nextclade The new mode is activated by passing the `--nextclade-compatibility` flag to augur ancestral. It does no harm to anyone who does not want to use it. It results in some code duplication, but I think it's totally worth it for the sake of usability. I have already started using it in some of my private builds and it works like a charm.
ec82215
to
566150e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be useful to have. reduces friction using nextclade datasets as annotation.
alternatively, we could use the nextclade conversion of the gff to json and parse the json, but this here should cover the majority of use cases.
One nitpick, just based on the description of the work: if the flag is going to be |
Good point, this is not 100% compatible, so maybe best to not mention Nextclade so explicitly in the arg name, i.e. use a different name. Proposals welcome! Nextclade only looks at the genes if there are no cdses present at all. If someone tries to use a gff like that, we could error if the flag is present. |
Maybe something like |
If there are multiple records, raise AugurError. | ||
Optionally, limit the GFF types to parse. | ||
""" | ||
from BCBio import GFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any sort of "all imports at top of file" convention or rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no code style enforcement in this codebase. Historically, each subcommand has different authors with varying styles, so there is also no convention.
For this line in particular, I think it's fine to scope the import to the function if no other functions in utils.py
need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much do top-of-file imports contribute to the painful loading times @victorlin? We're essentially importing every dependency in order to show --help
aren't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameshadfield good point and you're right. I wrote more in #472 (comment)
Description of proposed changes
Until this PR, it was hard to use augur ancestral with a genome annotation gff3 from a Nextclade dataset for multiple reasons:
This meant that users had to create a Genbank file, just for the purpose of running augur ancestral.
That was doable but tedious.
This PR solves the problem by adding a new GFF reader mode that is designed to be as compatible with Nextclade as possible. That means:
The new mode is activated by passing the
--use-nextclade-gff-parsing
flag to augur ancestral.It does no harm to anyone who does not want to use it.
It results in some code duplication, but I think it's totally worth it for the sake of usability.
I have already started using it in some of my private builds and it works like a charm.
Feedback requested
I'm not sure about the best name for the flag. Even I can't remember whether I chose
--nextclade-compatible
or--nextclade-compatibility
. Maybe I should shorten to--nextclade-compat
, or use--nextclade-gff-mode
instead?For now I've chosen
--use-nextclade-gff-parsing
but there's scope for bike shedding.Alternatives considered
Work around with other tooling to generate gb file from gff
I've tried various things to work around the issue with Nextclade GFF not working with augur ancestral. Around a year ago I made a gff-to-genbank CLI pip installable (https://github.com/corneliusroemer/gff-to-genbank) - I considered reusing it for my current project but ended up aborting that approach as it ended up being complicated: while I could write compound locations correctly, I still needed to do
sed
replacements of the qualifiers, so that spurious locus_tags in the GFF wouldn't end up getting priority over the feature names that Nextclade would use. I had to do yet anothersed
replacement on the produced genbank file to replace the "region" feature with "source".The solution proposed here is better in that it doesn't require users to install another tool, doesn't require any extra sed replacements.
Yes it does cause a bit of code duplication, and users need to know about the special
--nextclade-compatible
flag, but I just can't think of a better solution - this problem has existed for more than a year and 3 weeks we didn't seem any closer to a solution.Related issue(s)
addresses the point raised in #1655
Future work
There's a small follow-up PR that further increases usability of ancestral for translations: #1668 - it drops the requirement to pass
--genes
and by default uses all genes.Checklist