feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664

corneliusroemer · 2024-11-09T02:45:49Z

Description of proposed changes

Until this PR, it was hard to use augur ancestral with a genome annotation gff3 from a Nextclade dataset for multiple reasons:

The GFF reader in augur looks for "gene" features only, while Nextclade 3 uses primarily CDS (gene only for backwards compatibility)
Augur's GFF reader doesn't support compound features
It takes the "gene"/"feature" name in the order "gene" > "gene_name" > "locus_tag", which differs from Nextclade's order

This meant that users had to create a Genbank file, just for the purpose of running augur ancestral.
That was doable but tedious.

This PR solves the problem by adding a new GFF reader mode that is designed to be as compatible with Nextclade as possible. That means:

It looks for CDS features (for simplicity it ignores Nextclade's fallback support of gene features)
It supports compound features
It takes the feature name in the exact same order as Nextclade

The new mode is activated by passing the --use-nextclade-gff-parsing flag to augur ancestral.

It does no harm to anyone who does not want to use it.

It results in some code duplication, but I think it's totally worth it for the sake of usability.
I have already started using it in some of my private builds and it works like a charm.

Feedback requested

I'm not sure about the best name for the flag. Even I can't remember whether I chose --nextclade-compatible or --nextclade-compatibility. Maybe I should shorten to --nextclade-compat, or use --nextclade-gff-mode instead?

For now I've chosen --use-nextclade-gff-parsing but there's scope for bike shedding.

Alternatives considered

Work around with other tooling to generate gb file from gff

I've tried various things to work around the issue with Nextclade GFF not working with augur ancestral. Around a year ago I made a gff-to-genbank CLI pip installable (https://github.com/corneliusroemer/gff-to-genbank) - I considered reusing it for my current project but ended up aborting that approach as it ended up being complicated: while I could write compound locations correctly, I still needed to do sed replacements of the qualifiers, so that spurious locus_tags in the GFF wouldn't end up getting priority over the feature names that Nextclade would use. I had to do yet another sed replacement on the produced genbank file to replace the "region" feature with "source".

The solution proposed here is better in that it doesn't require users to install another tool, doesn't require any extra sed replacements.

Yes it does cause a bit of code duplication, and users need to know about the special --nextclade-compatible flag, but I just can't think of a better solution - this problem has existed for more than a year and 3 weeks we didn't seem any closer to a solution.

Related issue(s)

addresses the point raised in #1655

Future work

There's a small follow-up PR that further increases usability of ancestral for translations: #1668 - it drops the requirement to pass --genes and by default uses all genes.

Checklist

Explain how it works in the docs: some annotation section
Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests
Check if you need to update docs

codecov · 2024-11-09T23:20:12Z

Codecov Report

Attention: Patch coverage is 74.71264% with 22 lines in your changes missing coverage. Please review.

Project coverage is 72.25%. Comparing base (2e993df) to head (566150e).

Files with missing lines	Patch %	Lines
augur/utils.py	73.17%	9 Missing and 13 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1664   +/-   ##
=======================================
  Coverage   72.25%   72.25%           
=======================================
  Files          79       79           
  Lines        8276     8320   +44     
  Branches     1691     1707   +16     
=======================================
+ Hits         5980     6012   +32     
- Misses       2011     2014    +3     
- Partials      285      294    +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

augur/utils.py

corneliusroemer · 2024-11-10T19:45:09Z

Sorry for the noise closing, I typed the wrong gh cli command, which closed instead of checked out the PR 🤦

Until this PR, it was hard to use augur ancestral with a genome annotation gff3 from a Nextclade dataset for multiple reasons: - The GFF reader in augur looks for "gene" features only, while Nextclade 3 uses primarily CDS (gene only for backwards compatibility) - Augur's GFF reader doesn't support compound features - It takes the "gene"/"feature" name in the order "gene" > "gene_name" > "locus_tag", which differs from Nextclade's order This meant that users had to create a Genbank file, just for the purpose of running augur ancestral. That was doable but tedious. This PR solves the problem by adding a new GFF reader mode that is designed to be as compatible with Nextclade as possible. That means: - It looks for CDS features (for simplicity it ignores Nextclade's gene backwards compatibility) - It supports compound features - It takes the feature name in the exact same order as Nextclade The new mode is activated by passing the `--nextclade-compatibility` flag to augur ancestral. It does no harm to anyone who does not want to use it. It results in some code duplication, but I think it's totally worth it for the sake of usability. I have already started using it in some of my private builds and it works like a charm.

rneher

I think this would be useful to have. reduces friction using nextclade datasets as annotation.

alternatively, we could use the nextclade conversion of the gff to json and parse the json, but this here should cover the majority of use cases.

genehack · 2024-11-12T18:53:22Z

One nitpick, just based on the description of the work: if the flag is going to be --use-nextclade-gff-parsing, it strikes me as potentially deeply confusing and hard to untangle that the fallback of looking for gene features is present in Nextclade, but not here.

corneliusroemer · 2024-11-12T20:31:44Z

One nitpick, just based on the description of the work: if the flag is going to be --use-nextclade-gff-parsing, it strikes me as potentially deeply confusing and hard to untangle that the fallback of looking for gene features is present in Nextclade, but not here.

Good point, this is not 100% compatible, so maybe best to not mention Nextclade so explicitly in the arg name, i.e. use a different name. Proposals welcome!

Nextclade only looks at the genes if there are no cdses present at all.

If someone tries to use a gff like that, we could error if the flag is present.

https://github.com/nextstrain/nextclade/blob/c95471b4faa9121feea6102a333672ffcca74a06/packages/nextclade/src/gene/gene.rs#L72-L75

genehack · 2024-11-13T18:45:43Z

Good point, this is not 100% compatible, so maybe best to not mention Nextclade so explicitly in the arg name, i.e. use a different name. Proposals welcome!

Maybe something like --use-improved-gff-parsing or --use-v2-gff-parsing?

genehack · 2024-11-14T20:48:40Z

augur/utils.py

+    If there are multiple records, raise AugurError.
+    Optionally, limit the GFF types to parse.
+    """
+    from BCBio import GFF


Do we have any sort of "all imports at top of file" convention or rule?

There is no code style enforcement in this codebase. Historically, each subcommand has different authors with varying styles, so there is also no convention.

For this line in particular, I think it's fine to scope the import to the function if no other functions in utils.py need it.

How much do top-of-file imports contribute to the painful loading times @victorlin? We're essentially importing every dependency in order to show --help aren't we?

@jameshadfield good point and you're right. I wrote more in #472 (comment)

corneliusroemer mentioned this pull request Nov 9, 2024

feat: treat CDS as a valid feature for GFFs in load_features #1332

Closed

corneliusroemer requested review from rneher and jameshadfield November 9, 2024 02:56

This was referenced Nov 9, 2024

Missing escape of square brackets causes diff_jsons.py in some cram tests to be false negative #1665

Open

Gff3 file: differences between nextclade datasets and augur ancestral #1655

Open

corneliusroemer requested a review from huddlej November 9, 2024 22:15

jameshadfield mentioned this pull request Nov 10, 2024

feat(ancestral): make --genes optional when reconstructing translations, if omitted, translate all genes in annotation #1668

Draft

4 tasks

corneliusroemer commented Nov 10, 2024

View reviewed changes

augur/utils.py Outdated Show resolved Hide resolved

corneliusroemer closed this Nov 10, 2024

corneliusroemer reopened this Nov 10, 2024

corneliusroemer closed this Nov 10, 2024

corneliusroemer reopened this Nov 10, 2024

corneliusroemer added 7 commits November 10, 2024 20:45

fix test

46d2f81

Propagate valid types

4571e59

Add cram test for Nextclade compat mode

5d8fe73

Fix test, rename argument to --use-nextclade-gff-parsing

6e96773

Fix mypy errors by reducing typing information :/

1280a95

Update augur/utils.py

566150e

corneliusroemer force-pushed the nextclade-compat branch from ec82215 to 566150e Compare November 10, 2024 19:45

rneher approved these changes Nov 11, 2024

View reviewed changes

corneliusroemer requested review from victorlin and joverlee521 November 12, 2024 13:54

genehack reviewed Nov 14, 2024

View reviewed changes

victorlin mentioned this pull request Nov 14, 2024

Improve startup time #472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664

feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664

corneliusroemer commented Nov 9, 2024 •

edited

Loading

codecov bot commented Nov 9, 2024 •

edited

Loading

corneliusroemer commented Nov 10, 2024

rneher left a comment

genehack commented Nov 12, 2024

corneliusroemer commented Nov 12, 2024

genehack commented Nov 13, 2024

genehack Nov 14, 2024

victorlin Nov 14, 2024

jameshadfield Nov 14, 2024

victorlin Nov 14, 2024

feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664

Are you sure you want to change the base?

feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664

Conversation

corneliusroemer commented Nov 9, 2024 • edited Loading

Description of proposed changes

Feedback requested

Alternatives considered

Work around with other tooling to generate gb file from gff

Related issue(s)

Future work

Checklist

codecov bot commented Nov 9, 2024 • edited Loading

Codecov Report

corneliusroemer commented Nov 10, 2024

rneher left a comment

Choose a reason for hiding this comment

genehack commented Nov 12, 2024

corneliusroemer commented Nov 12, 2024

genehack commented Nov 13, 2024

genehack Nov 14, 2024

Choose a reason for hiding this comment

victorlin Nov 14, 2024

Choose a reason for hiding this comment

jameshadfield Nov 14, 2024

Choose a reason for hiding this comment

victorlin Nov 14, 2024

Choose a reason for hiding this comment

corneliusroemer commented Nov 9, 2024 •

edited

Loading

codecov bot commented Nov 9, 2024 •

edited

Loading