Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Different gene caller option(s) for anvi-gen-contigs-database #2298

Open
ivagljiva opened this issue Jul 9, 2024 · 5 comments
Assignees

Comments

@ivagljiva
Copy link
Contributor

A small project to improve anvi'o, based upon feedback/ideas @FlorianTrigodet and I heard from our colleagues at the QIB in Norwich.

The need

There is interest in being able to use alternative gene calling software in addition to prodigal, within anvi'o (ie, instead of having to run gene calling outside of anvi'o and using external gene calls). We've heard specifically about prodigal-gv, a fork of prodigal that has additions to improve gene calling for viruses, and pyrodigal/pyrodigal-gv which are the respective Python modules for using these software directly in the code. However, there could be other gene callers of interest to the community.

The solution

This small project is flexible in scope depending on which gene calling software we want to support and how far you (the developer) want to go with the refactor. Here are some possibilities:

  • implementing prodigal-gv could be as simple as adding a variable to store either prodigal or prodigal-gv according to user input, and replacing all instances of calling prodigal with this variable. It would use the same driver/parser modules as prodigal uses, and in theory no further changes would be necessary
  • incorporating one (or both) of the pyrodigal options would require changes to how we actually run the gene calling step. We would no longer use a driver program that runs the prodigal binary, but would switch that to using the pyrodigal classes directly. Multi-threading and parsing of the results would also have to change to be compatible with those classes (they are thread-safe but it looks like we would still manage the multi-threading on our own).
  • This could be a good opportunity to refactor the way we store gene call information in the anvi'o databases, to incorporate additional data as suggested in [FEATURE REQUEST] Preserve prodigal metadata for anvi-export-gene-calls #2181 and [FEATURE REQUEST] Refactoring Anvio to be more eukaryote friendly/account for different genetic architectures #2297. That would require much more extensive code changes.

Beneficiaries

All users of anvi'o, but (in the case of prodigal-gv) especially those who work on viruses.

@xvazquezc
Copy link
Contributor

Using pyrodigal as default or having it as option would be great. There are known bugs in prodigal that will never be addressed (not being dev anymore) but have been fixed in pyrodigal, e.g. problems with the gene calls in the reverse strand.

@FlorianTrigodet
Copy link
Contributor

Thanks @xvazquezc, I just found out all the unfixed bugs in prodigal that were fixed in prodigal-gv and pyrogidal/pyrogidal-gv!

Just for documentation, here are some known issues:

We should default to pyrodigal/pyrogidal-gv.

@apcamargo
Copy link

apcamargo commented Jul 11, 2024

pyrodigal-gv is just a tiny layer on top of pyrodigal, so it would be trivial to have an flag that allows the user to disable the additional gene models that are included in pyrodigal-gv. prodigal-gv would be a simpler change from Prodigal and it does include the fixes, but pyrodigal-gv is faster and it makes it much easier to get gene data, as you won't have to parse Prodigal/prodigal-gv outputs.

I don't know how multi-threading is managed in anvi'o, but maybe this will be relevant: althonos/pyrodigal#57

implementing prodigal-gv could be as simple as adding a variable to store either prodigal or prodigal-gv according to user input, and replacing all instances of calling prodigal with this variable. It would use the same driver/parser modules as prodigal uses, and in theory no further changes would be necessary

Depending on how you are parsing Prodigal's outputs, you might to change the parsing code a bit because prodigal-gv includes the genetic code in the outputs (apcamargo/prodigal-gv@120c779). Since having alternative genetic codes was one of the main reasons I developed prodigal-gv in the first place, I decided to make it obvious to users when a model with an alternative code was used.

@meren meren self-assigned this Jul 11, 2024
@meren
Copy link
Member

meren commented Jul 11, 2024

Thank you very much for your input, @apcamargo. I will work on this and try to come up with a modular solution.

@apcamargo
Copy link

Sure! Let me know if there's anything I can help.

Another (minor) consequence of changing the gene caller that I just remembered, and that is somewhat related to an issue that I opened a few months ago (#2195), is that the alternative genetic codes are not taken into account in anvi-gen-variability-profile. This is an issue even in vanilla Prodigal (which includes translation table 4 in the metagenome mode), but using prodigal-gv and pyrodigal-gv would increase the amount of sequences translated with alternative codes (translation table 15).

In my data, I wrote the code to compute pN/pS from scratch (due to the bug in the potential computation I linked above) and, as far as I remember, the effect of alternative genetic codes in pN/pS was negligible. So, I don't think this is something super important, but could be good to have in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants