Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ancestral): Add optional Nextclade GFF3 compatibility mode #1664

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Nov 9, 2024

Description of proposed changes

Until this PR, it was hard to use augur ancestral with a genome annotation gff3 from a Nextclade dataset for multiple reasons:

  • The GFF reader in augur looks for "gene" features only, while Nextclade 3 uses primarily CDS (gene only for backwards compatibility)
  • Augur's GFF reader doesn't support compound features
  • It takes the "gene"/"feature" name in the order "gene" > "gene_name" > "locus_tag", which differs from Nextclade's order

This meant that users had to create a Genbank file, just for the purpose of running augur ancestral.
That was doable but tedious.

This PR solves the problem by adding a new GFF reader mode that is designed to be as compatible with Nextclade as possible. That means:

  • It looks for CDS features (for simplicity it ignores Nextclade's fallback support of gene features)
  • It supports compound features
  • It takes the feature name in the exact same order as Nextclade

The new mode is activated by passing the --use-nextclade-gff-parsing flag to augur ancestral.

It does no harm to anyone who does not want to use it.

It results in some code duplication, but I think it's totally worth it for the sake of usability.
I have already started using it in some of my private builds and it works like a charm.

Feedback requested

I'm not sure about the best name for the flag. Even I can't remember whether I chose --nextclade-compatible or --nextclade-compatibility. Maybe I should shorten to --nextclade-compat, or use --nextclade-gff-mode instead?

For now I've chosen --use-nextclade-gff-parsing but there's scope for bike shedding.

Alternatives considered

Work around with other tooling to generate gb file from gff

I've tried various things to work around the issue with Nextclade GFF not working with augur ancestral. Around a year ago I made a gff-to-genbank CLI pip installable (https://github.com/corneliusroemer/gff-to-genbank) - I considered reusing it for my current project but ended up aborting that approach as it ended up being complicated: while I could write compound locations correctly, I still needed to do sed replacements of the qualifiers, so that spurious locus_tags in the GFF wouldn't end up getting priority over the feature names that Nextclade would use. I had to do yet another sed replacement on the produced genbank file to replace the "region" feature with "source".

The solution proposed here is better in that it doesn't require users to install another tool, doesn't require any extra sed replacements.

Yes it does cause a bit of code duplication, and users need to know about the special --nextclade-compatible flag, but I just can't think of a better solution - this problem has existed for more than a year and 3 weeks we didn't seem any closer to a solution.

Related issue(s)

addresses the point raised in #1655

Future work

There's a small follow-up PR that further increases usability of ancestral for translations: #1668 - it drops the requirement to pass --genes and by default uses all genes.

Checklist

  • Explain how it works in the docs: some annotation section
  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

Copy link

codecov bot commented Nov 9, 2024

Codecov Report

Attention: Patch coverage is 74.71264% with 22 lines in your changes missing coverage. Please review.

Project coverage is 72.25%. Comparing base (2e993df) to head (566150e).

Files with missing lines Patch % Lines
augur/utils.py 73.17% 9 Missing and 13 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1664   +/-   ##
=======================================
  Coverage   72.25%   72.25%           
=======================================
  Files          79       79           
  Lines        8276     8320   +44     
  Branches     1691     1707   +16     
=======================================
+ Hits         5980     6012   +32     
- Misses       2011     2014    +3     
- Partials      285      294    +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

augur/utils.py Outdated Show resolved Hide resolved
@corneliusroemer
Copy link
Member Author

Sorry for the noise closing, I typed the wrong gh cli command, which closed instead of checked out the PR 🤦

Until this PR, it was hard to use augur ancestral with a genome annotation gff3 from a Nextclade dataset for multiple reasons:
- The GFF reader in augur looks for "gene" features only, while Nextclade 3 uses primarily CDS (gene only for backwards compatibility)
- Augur's GFF reader doesn't support compound features
- It takes the "gene"/"feature" name in the order "gene" > "gene_name" > "locus_tag", which differs from Nextclade's order

This meant that users had to create a Genbank file, just for the purpose of running augur ancestral.
That was doable but tedious.

This PR solves the problem by adding a new GFF reader mode that is designed to be as compatible with Nextclade as possible. That means:
- It looks for CDS features (for simplicity it ignores Nextclade's gene backwards compatibility)
- It supports compound features
- It takes the feature name in the exact same order as Nextclade

The new mode is activated by passing the `--nextclade-compatibility` flag to augur ancestral.

It does no harm to anyone who does not want to use it.

It results in some code duplication, but I think it's totally worth it for the sake of usability.
I have already started using it in some of my private builds and it works like a charm.
Copy link
Member

@rneher rneher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be useful to have. reduces friction using nextclade datasets as annotation.

alternatively, we could use the nextclade conversion of the gff to json and parse the json, but this here should cover the majority of use cases.

@genehack
Copy link
Contributor

One nitpick, just based on the description of the work: if the flag is going to be --use-nextclade-gff-parsing, it strikes me as potentially deeply confusing and hard to untangle that the fallback of looking for gene features is present in Nextclade, but not here.

@corneliusroemer
Copy link
Member Author

One nitpick, just based on the description of the work: if the flag is going to be --use-nextclade-gff-parsing, it strikes me as potentially deeply confusing and hard to untangle that the fallback of looking for gene features is present in Nextclade, but not here.

Good point, this is not 100% compatible, so maybe best to not mention Nextclade so explicitly in the arg name, i.e. use a different name. Proposals welcome!

Nextclade only looks at the genes if there are no cdses present at all.

If someone tries to use a gff like that, we could error if the flag is present.

https://github.com/nextstrain/nextclade/blob/c95471b4faa9121feea6102a333672ffcca74a06/packages/nextclade/src/gene/gene.rs#L72-L75

@genehack
Copy link
Contributor

Good point, this is not 100% compatible, so maybe best to not mention Nextclade so explicitly in the arg name, i.e. use a different name. Proposals welcome!

Maybe something like --use-improved-gff-parsing or --use-v2-gff-parsing?

If there are multiple records, raise AugurError.
Optionally, limit the GFF types to parse.
"""
from BCBio import GFF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any sort of "all imports at top of file" convention or rule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no code style enforcement in this codebase. Historically, each subcommand has different authors with varying styles, so there is also no convention.

For this line in particular, I think it's fine to scope the import to the function if no other functions in utils.py need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much do top-of-file imports contribute to the painful loading times @victorlin? We're essentially importing every dependency in order to show --help aren't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshadfield good point and you're right. I wrote more in #472 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants