Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] anvi-run-hmms does not accurately detect 5S rRNA #2344

Open
philipwoods opened this issue Sep 20, 2024 · 8 comments
Open

[BUG] anvi-run-hmms does not accurately detect 5S rRNA #2344

philipwoods opened this issue Sep 20, 2024 · 8 comments

Comments

@philipwoods
Copy link

philipwoods commented Sep 20, 2024

Short description of the problem

Even in single-contig circularized genomes, anvi-run-hmms does not reliably detect the 5S rRNA gene.

anvi'o version

Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.13

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Our operating system is Red Hat enterprise Linux. Anvio was installed in a conda environment.

Detailed description of the issue

Looking at the ribosomal RNA HMM readme file, it appears that the ribosomal RNA HMMs all use a GA cutoff of 750 bitscore. However, the 5S rRNA gene is short, and when I run the 5S HMM independently on circularized genomes I see that alignment along the full length of the HMM (about 115nt) only reaches an e-value of about 1e-14 or 1e-15 and a bitscore of 52-55. This suggests that the current bitscore cutoff makes it impossible to detect the 5S rRNA gene with anvi-run-hmms. After more testing on a few other genomes to refine the cutoff, I would suggest changing the GA bitscore cutoff for the 5S rRNA HMM to 45.

Files / commands to reproduce the issue

To test Anvio HMMs on a circularized genome found here:

anvi-get-sequences-for-hmm-hits -c Methanosarcina_acetivorans_GCF_000007345.1.db --hmm-sources Ribosomal_RNA_5S -o 5S-test.fasta

To run the HMM independently of Anvio:

nhmmer 5S-rRNA.hmm Methanosarcina_acetivorans_GCF_000007345.1.fna

To test the HMM independently with the GA threshold:

nhmmer --cut_ga 5S-rRNA.hmm Methanosarcina_acetivorans_GCF_000007345.1.fna

@xvazquezc
Copy link
Contributor

If I'm not wrong, the rRNA HMM models come from Barrnap. Tbh, I've had similar issues recently with other pipelines that rely on Barrnap for rRNA detection, also with a collection of archaeal circularised genomes. Mainly missing the 5S but occasionally also some copies of 16S. For single genome annotations I ended using Infernal with the covariance models from Rfam.

@philipwoods
Copy link
Author

The HMMs do come from Barrnap, but the Anvio readme file about how they imported them says they arbitrarily added a GA cutoff of 750 bitscore that wasn't present in the original HMMs.

@meren
Copy link
Member

meren commented Sep 21, 2024

Thanks for bringing this up, @philipwoods. Do you happen to have any insights whether changing that GA cutoff changes the rate of the recognition of 5S rRNA?

We can also remove the GA cutoff and put a significance threshold there, as well. I'm happy to take a look and see what works well, but it would be much better if it was done by someone who is genuinely interested in identifying 5S rRNAs more effectively, and we always appreciate any help :)

@philipwoods
Copy link
Author

My apologies for the slow response! When I ran the 5S HMM independently of Anvi'o, the lowest bitscore I saw from circularized reference genomes was about 47-48, so I would suggest a GA cutoff slightly below that. I occasionally saw recognition of multiple 5S genes per genome, but rarely. However, I don't have any other special knowledge about the 5S rRNA gene :). For reference, the reason I was looking for it is that the presence of 23S, 16S, and 5S genes are required to call a MAG high-quality according to the standard presented here, so I imagine that this functionality will be useful to wanting to work with high-quality MAGs. If you have suggestions of things to test, I'm happy to help out :).

@xvazquezc
Copy link
Contributor

@philipwoods you usually have as many 5S as SSU/LSU copies, although it can vary a bit.

With the archaea I'm working, the 5S get a maximum score of ~23 (same seqs detected with infernal/Rfam). The bit score is heavily dependent on the size of the profile/gene, and the 5S is very tiny, ~120 nt. Pretty sure an score of 750 is impossible.

I created a 5S hmm profile off the Rfam seed sequences, and the maximum score I got was ~95 and that's searching the source sequences against the profile. My 5S got a bump of nearly 10 points to ~32, prob just bc the profile is way more up to date than the one in barrnap.

Nothing beats using covariance models for ncRNAs, but running the whole Rfam might be undesirable for a whole metagenome, although if anything that it's not a miRNA/tRNA (tRNAscan-SE is still better for tRNA) is removed from screening it should make it way speedier...

@meren
Copy link
Member

meren commented Dec 9, 2024

Can we use your 5S profile in anvi'o @xvazquezc? Would you consider sending a PR? We would love to update those models!

@xvazquezc
Copy link
Contributor

A bit tight on time, but I can do something maybe better, I can give the code to generate new hmm profiles for all the rRNA molecules available in Rfam. This is pretty much what I did yesterday for the 5S profile, but scaled up.
Just need HMMER and the easel toolkit that comes with Infernal:

wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.seed.gz
gunzip Rfam.seed.gz

RfamID=( 5S_rRNA 5_8S_rRNA SSU_rRNA_bacteria SSU_rRNA_archaea SSU_rRNA_eukarya LSU_rRNA_archaea LSU_rRNA_bacteria SSU_rRNA_microsporidia LSU_rRNA_eukarya SSU_trypano_mito LSU_trypano_mito )

for i in ${RfamID[@]}; do
    esl-afetch Rfam.seed ${i} > ${i}.sto
    hmmbuild ${i}.hmm ${i}.sto
done

@meren
Copy link
Member

meren commented Dec 10, 2024

Thanks for this, @xvazquezc. I'll mark this as a small project with the hope that someone will pick it up and work on this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants