-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61
Comments
Hi,
means your genome assembly is contaminated with prokaryote sequences. Please run FCS on it first to clean it up. Pooja |
Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.
Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination. |
Hi,
Can you provide your input yaml
What do you mean by this? Eric |
genome: /data/dell/CNI.2024.10.5/1.deer.ref/GCA_040085125.1_ASM4008512v1_genomic.fna
During a successful run, I mistakenly changed the taxi ID to 1618199 |
Public assemblies aren't necessarily clean of contamination, but in this case we're not reporting anything: and given the nature of the assembly (HiFi, very few contigs) and species, I agree it's unlikely that there's much if any residual contamination.
I'm surprised that had an effect. The only thing I can think of that it would influence is the starting hmm parameters, which could ultimately result in slightly different final parameters but the training cycle should have resulted in minimal differences. We'll investigate further. Thanks for the report. |
Also, could you clarify some details about your config?
If your original run was on the deer GCA_040085125.1 FASTA, using txid 6954 (a tick), then it would have skipped hmm parameter training and used the tick hmm parameters directly. That could have a fairly large difference compared to using txid 1618199, especially at this stage. |
The parameters that have been successfully run by chance are genome: /data/dell/CNI.2024.10.5/1.deer.ref/GCA_040085125.1_ASM4008512v1_genomic.fna
1618199 represents an animal of the genus Equus |
Hi @bismarck1008, we were able to reproduce your error. We have traced the bug and will be releasing a new version with the bug fix shortly. Thank you for your patience, and also for reporting so that we can make better EGAPx! |
It is a pleasure to participate in this meaningful work, and thank you very much for your efforts |
Dear all, |
Unfortunately we don't have a workaround for v0.3.1. We may be able to release a new version soon, although the holidays and some other issues outside of our control may cause some delays. |
Subject: Issue Encountered While Annotating the Red Deer Genome Using Transcriptome Data Dear Team, I hope this message finds you well. I encountered an issue while attempting to annotate the red deer genome (GCA_910594005.1) using transcriptome data obtained from the genome's full annotation report (https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Cervus_elaphus/100/). I downloaded the relevant files locally and followed the usual steps for genome annotation, but I ran into an error during execution. Here’s a summary of the error I encountered: Classifier internal state for EGAPx Test Assembly: Allowing locusType-631 models: true dditionally, here are the parameters I used for the annotation: genome: /mnt/z/libin/BCLgenome/04Geneprediction/10NCBIanoation/RedDeer.fa
Interestingly, I was able to successfully run the annotation with the sheep transcriptome on the sheep genome, which makes me wonder if this issue could be specific to cervid genomes. One possible explanation I’m considering is that the red deer genome might have undergone soft masking, which could affect the alignment and annotation process. Alternatively, the transcriptome data might not have been adequately quality-controlled before use. Thank you very much for your time and assistance. Best regards |
python3 /data/bio-software/egapx/ui/egapx.py ../data/input_D_farinae_small.local5.yaml -e docker -o GCA_040085125.1_ASM4008512v1_out2 -w ./temp
[6f/72e38f] process > egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype [100%] 4 of 4, failed: 4, retries: 3 ✘
ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype'
Caused by:
Process
egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype
terminated with an error exit status (3)Command executed:
mkdir -p output
mkdir -p ./asncache/
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i swissprot.asnb -oseq-ids spids -split-sequences
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i gnomon_wnode.out -oseq-ids gnids -split-sequences
lds2_indexer -source genome/ -db LDS2
echo "hits.diamond.asn" > raw_blastp_hits.mft
merge_blastp_hits -asn-cache ./asncache/ -nogenbank -lds2 LDS2 -input-manifest raw_blastp_hits.mft -o prot_hits.asn
echo "gnomon_wnode.out" > models.mft
echo "prot_hits.asn" > prot_hits.mft
echo "" > splices.mft
if [ -z "" ]
then
gnomon_biotype -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_hits prot_hits.mft -prot_splices sp
lices.mft -reftrack-server 'NONE' -allow_lt631 true
else
gnomon_biotype -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_denylist -prot_hits prot_hits.mft
-prot_splices splices.mft -reftrack-server 'NONE' -allow_lt631 true
fi
Command exit status:
3
Command output:
(empty)
Command error:
Prefetching 4358 bioseqs
Prefetching 4605 bioseqs
Prefetching 5119 bioseqs
Prefetching 5021 bioseqs
Prefetching 4387 bioseqs
Prefetching 4726 bioseqs
Prefetching 4029 bioseqs
Prefetching 4442 bioseqs
Prefetching 4865 bioseqs
Prefetching 3001 bioseqs
Prefetching 4831 bioseqs
Prefetching 4484 bioseqs
Prefetching 2711 bioseqs
Prefetching 1272 bioseqs
Prefetching 1170 bioseqs
Second-pass: computing bestness scores
Starting.
Fetching Gnomon model data.
Loading GC-Assembly.
Taxon is invertebrate or plant - will allow more coding models
Loading protein hits
Skipped 19932 protein hits without corresponding CDS features
Processed 274631 hits; accepted 141526; 24500 are RBPH
Loading protein data.
Retrieving attributes for 43024 prots
Fetching next batch of 10000
Fetching next batch of 10000
Fetching next batch of 10000
Fetching next batch of 10000
Creating classifier.
Classifier internal state for EGAPx Test Assembly:
0: 907233/264=3436.49 907233/1442=629.149
1: 1.06183e+06/3514=302.172 1.06183e+06/7267=146.117
M=[730 326; 398 2826]; PPV=0.64659; NPV=0.896289; ACC=0.830647
Allowing locusType-631 models: true
Initialized 10 patterns for attr_rule=538.
Initialized 36 patterns for attr_rule=489.
Initialized 6 patterns for attr_rule=989.
Initialized 11 patterns for attr_rule=986.
Initialized 6 patterns for attr_rule=987.
Initialized 5 patterns for attr_rule=988.
Outputting.
Initialized 70 patterns for attr_rule=869.
BPH to proks: 5.88253%
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)
Work dir:
/data/dell/CNI.2024.10.5/2.anotation/test/temp/6f/72e38fd113038b2726553d1e7b22e0
Tip: you can replicate the issue by changing to the process work dir and entering the command
bash .command.run
-- Check 'GCA_040085125.1_ASM4008512v1_out2/nextflow.log' file for details
The text was updated successfully, but these errors were encountered: