-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] anvi-run-hmms
does not accurately detect 5S rRNA
#2344
Comments
If I'm not wrong, the rRNA HMM models come from Barrnap. Tbh, I've had similar issues recently with other pipelines that rely on Barrnap for rRNA detection, also with a collection of archaeal circularised genomes. Mainly missing the 5S but occasionally also some copies of 16S. For single genome annotations I ended using Infernal with the covariance models from Rfam. |
The HMMs do come from Barrnap, but the Anvio readme file about how they imported them says they arbitrarily added a GA cutoff of 750 bitscore that wasn't present in the original HMMs. |
Thanks for bringing this up, @philipwoods. Do you happen to have any insights whether changing that GA cutoff changes the rate of the recognition of 5S rRNA? We can also remove the GA cutoff and put a significance threshold there, as well. I'm happy to take a look and see what works well, but it would be much better if it was done by someone who is genuinely interested in identifying 5S rRNAs more effectively, and we always appreciate any help :) |
My apologies for the slow response! When I ran the 5S HMM independently of Anvi'o, the lowest bitscore I saw from circularized reference genomes was about 47-48, so I would suggest a GA cutoff slightly below that. I occasionally saw recognition of multiple 5S genes per genome, but rarely. However, I don't have any other special knowledge about the 5S rRNA gene :). For reference, the reason I was looking for it is that the presence of 23S, 16S, and 5S genes are required to call a MAG high-quality according to the standard presented here, so I imagine that this functionality will be useful to wanting to work with high-quality MAGs. If you have suggestions of things to test, I'm happy to help out :). |
@philipwoods you usually have as many 5S as SSU/LSU copies, although it can vary a bit. With the archaea I'm working, the 5S get a maximum score of ~23 (same seqs detected with infernal/Rfam). The bit score is heavily dependent on the size of the profile/gene, and the 5S is very tiny, ~120 nt. Pretty sure an score of 750 is impossible. I created a 5S hmm profile off the Rfam seed sequences, and the maximum score I got was ~95 and that's searching the source sequences against the profile. My 5S got a bump of nearly 10 points to ~32, prob just bc the profile is way more up to date than the one in barrnap. Nothing beats using covariance models for ncRNAs, but running the whole Rfam might be undesirable for a whole metagenome, although if anything that it's not a miRNA/tRNA (tRNAscan-SE is still better for tRNA) is removed from screening it should make it way speedier... |
Can we use your 5S profile in anvi'o @xvazquezc? Would you consider sending a PR? We would love to update those models! |
A bit tight on time, but I can do something maybe better, I can give the code to generate new hmm profiles for all the rRNA molecules available in Rfam. This is pretty much what I did yesterday for the 5S profile, but scaled up. wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.seed.gz
gunzip Rfam.seed.gz
RfamID=( 5S_rRNA 5_8S_rRNA SSU_rRNA_bacteria SSU_rRNA_archaea SSU_rRNA_eukarya LSU_rRNA_archaea LSU_rRNA_bacteria SSU_rRNA_microsporidia LSU_rRNA_eukarya SSU_trypano_mito LSU_trypano_mito )
for i in ${RfamID[@]}; do
esl-afetch Rfam.seed ${i} > ${i}.sto
hmmbuild ${i}.hmm ${i}.sto
done |
Thanks for this, @xvazquezc. I'll mark this as a small project with the hope that someone will pick it up and work on this :) |
Short description of the problem
Even in single-contig circularized genomes,
anvi-run-hmms
does not reliably detect the 5S rRNA gene.anvi'o version
System info
Our operating system is Red Hat enterprise Linux. Anvio was installed in a conda environment.
Detailed description of the issue
Looking at the ribosomal RNA HMM readme file, it appears that the ribosomal RNA HMMs all use a GA cutoff of 750 bitscore. However, the 5S rRNA gene is short, and when I run the 5S HMM independently on circularized genomes I see that alignment along the full length of the HMM (about 115nt) only reaches an e-value of about 1e-14 or 1e-15 and a bitscore of 52-55. This suggests that the current bitscore cutoff makes it impossible to detect the 5S rRNA gene with
anvi-run-hmms
. After more testing on a few other genomes to refine the cutoff, I would suggest changing the GA bitscore cutoff for the 5S rRNA HMM to 45.Files / commands to reproduce the issue
To test Anvio HMMs on a circularized genome found here:
anvi-get-sequences-for-hmm-hits -c Methanosarcina_acetivorans_GCF_000007345.1.db --hmm-sources Ribosomal_RNA_5S -o 5S-test.fasta
To run the HMM independently of Anvio:
nhmmer 5S-rRNA.hmm Methanosarcina_acetivorans_GCF_000007345.1.fna
To test the HMM independently with the GA threshold:
nhmmer --cut_ga 5S-rRNA.hmm Methanosarcina_acetivorans_GCF_000007345.1.fna
The text was updated successfully, but these errors were encountered: