ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61

bismarck1008 · 2024-11-26T02:02:32Z

python3 /data/bio-software/egapx/ui/egapx.py ../data/input_D_farinae_small.local5.yaml -e docker -o GCA_040085125.1_ASM4008512v1_out2 -w ./temp

[6f/72e38f] process > egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype [100%] 4 of 4, failed: 4, retries: 3 ✘

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype'

Caused by:
Process egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype terminated with an error exit status (3)

Command executed:

mkdir -p output
mkdir -p ./asncache/
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i swissprot.asnb -oseq-ids spids -split-sequences
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i gnomon_wnode.out -oseq-ids gnids -split-sequences
lds2_indexer -source genome/ -db LDS2
echo "hits.diamond.asn" > raw_blastp_hits.mft
merge_blastp_hits -asn-cache ./asncache/ -nogenbank -lds2 LDS2 -input-manifest raw_blastp_hits.mft -o prot_hits.asn
echo "gnomon_wnode.out" > models.mft
echo "prot_hits.asn" > prot_hits.mft
echo "" > splices.mft
if [ -z "" ]
then
gnomon_biotype -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_hits prot_hits.mft -prot_splices sp
lices.mft -reftrack-server 'NONE' -allow_lt631 true
else
gnomon_biotype -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_denylist -prot_hits prot_hits.mft
-prot_splices splices.mft -reftrack-server 'NONE' -allow_lt631 true
fi

Command exit status:
3

Command output:
(empty)

Command error:
Prefetching 4358 bioseqs
Prefetching 4605 bioseqs
Prefetching 5119 bioseqs
Prefetching 5021 bioseqs
Prefetching 4387 bioseqs
Prefetching 4726 bioseqs
Prefetching 4029 bioseqs
Prefetching 4442 bioseqs
Prefetching 4865 bioseqs
Prefetching 3001 bioseqs
Prefetching 4831 bioseqs
Prefetching 4484 bioseqs
Prefetching 2711 bioseqs
Prefetching 1272 bioseqs
Prefetching 1170 bioseqs
Second-pass: computing bestness scores

Starting.
Fetching Gnomon model data.
Loading GC-Assembly.
Taxon is invertebrate or plant - will allow more coding models
Loading protein hits
Skipped 19932 protein hits without corresponding CDS features
Processed 274631 hits; accepted 141526; 24500 are RBPH
Loading protein data.
Retrieving attributes for 43024 prots
Fetching next batch of 10000
Fetching next batch of 10000
Fetching next batch of 10000
Fetching next batch of 10000
Creating classifier.

Classifier internal state for EGAPx Test Assembly:
0: 907233/264=3436.49 907233/1442=629.149
1: 1.06183e+06/3514=302.172 1.06183e+06/7267=146.117
M=[730 326; 398 2826]; PPV=0.64659; NPV=0.896289; ACC=0.830647

Allowing locusType-631 models: true
Initialized 10 patterns for attr_rule=538.
Initialized 36 patterns for attr_rule=489.
Initialized 6 patterns for attr_rule=989.
Initialized 11 patterns for attr_rule=986.
Initialized 6 patterns for attr_rule=987.
Initialized 5 patterns for attr_rule=988.
Outputting.
Initialized 70 patterns for attr_rule=869.
BPH to proks: 5.88253%
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)

Work dir:
/data/dell/CNI.2024.10.5/2.anotation/test/temp/6f/72e38fd113038b2726553d1e7b22e0

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check 'GCA_040085125.1_ASM4008512v1_out2/nextflow.log' file for details

The text was updated successfully, but these errors were encountered:

pstrope · 2024-11-26T02:14:08Z

Hi,
This error

Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)

means your genome assembly is contaminated with prokaryote sequences. Please run FCS on it first to clean it up.
See https://github.com/ncbi/fcs/wiki

Pooja

bismarck1008 · 2024-11-26T05:02:32Z

Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.

Hi, This error
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)
means your genome assembly is contaminated with prokaryote sequences. Please run FCS on it first to clean it up. See https://github.com/ncbi/fcs/wiki

Pooja

Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.

etvedte · 2024-11-26T15:49:48Z

Hi,

genome uses data from the NCBI database, so there should be no contamination.

Can you provide your input yaml input_D_farinae_small.local5.yaml for this run? I want to know how you are running this and what the source genome is to produce this log.

But if I type in the class number of the odd-footed species, it works.

What do you mean by this?

Eric

bismarck1008 · 2024-11-26T23:46:43Z

Hi,

genome uses data from the NCBI database, so there should be no contamination.

Can you provide your input yaml input_D_farinae_small.local5.yaml for this run? I want to know how you are running this and what the source genome is to produce this log.

But if I type in the class number of the odd-footed species, it works.

What do you mean by this?

Eric

genome: /data/dell/CNI.2024.10.5/1.deer.ref/GCA_040085125.1_ASM4008512v1_genomic.fna
taxid: 9863
reads:

/data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R1.fq.gz
/data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R2.fq.gz

During a successful run, I mistakenly changed the taxi ID to 1618199

murphyte · 2024-11-27T12:38:56Z

Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.

Public assemblies aren't necessarily clean of contamination, but in this case we're not reporting anything:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/040/085/125/GCA_040085125.1_ASM4008512v1/GCA_040085125.1_ASM4008512v1_fcs_report.txt

and given the nature of the assembly (HiFi, very few contigs) and species, I agree it's unlikely that there's much if any residual contamination.

During a successful run, I mistakenly changed the taxi ID to 1618199

I'm surprised that had an effect. The only thing I can think of that it would influence is the starting hmm parameters, which could ultimately result in slightly different final parameters but the training cycle should have resulted in minimal differences. We'll investigate further. Thanks for the report.

murphyte · 2024-11-27T13:11:47Z

Also, could you clarify some details about your config?
I see the name "input_D_farinae_small.local5.yaml", which is similar to our test config, but GCA_040085125.1 for the output (a deer). But it also output "Taxon is invertebrate or plant - will allow more coding models" in the log.

which taxid is your original run with? GCA_040085125.1 is txid 9863
what's the FASTA? Is it for sika deer GCA_040085125.1?

If your original run was on the deer GCA_040085125.1 FASTA, using txid 6954 (a tick), then it would have skipped hmm parameter training and used the tick hmm parameters directly. That could have a fairly large difference compared to using txid 1618199, especially at this stage.

bismarck1008 · 2024-11-27T13:33:04Z

The parameters that have been successfully run by chance are

genome: /data/dell/CNI.2024.10.5/1.deer.ref/GCA_040085125.1_ASM4008512v1_genomic.fna
taxid: 1618199
reads:

- /data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R1.fq.gz
- /data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R2.fq.gz

1618199 represents an animal of the genus Equus
The FASTA is GCA_040085125.1_ASM4008512v1_genomic.fna for sika deer.
Another thing that needs to be added is that the network here is unstable, and it is unclear whether the automatically downloaded network data is complete.

pstrope · 2024-12-03T17:47:05Z

Hi @bismarck1008, we were able to reproduce your error. We have traced the bug and will be releasing a new version with the bug fix shortly. Thank you for your patience, and also for reporting so that we can make better EGAPx!

bismarck1008 · 2024-12-03T23:57:54Z

It is a pleasure to participate in this meaningful work, and thank you very much for your efforts

hjsbio · 2024-12-19T03:32:53Z

Dear all,
I have encountered a similar problem:
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178).
Since the new version has not been released yet, could you please provide a workaround for this issue? Alternatively, could you let me know when the new version (container image) will be published?

murphyte · 2024-12-19T12:55:59Z

Unfortunately we don't have a workaround for v0.3.1. We may be able to release a new version soon, although the holidays and some other issues outside of our control may cause some delays.

fmassive · 2024-12-24T01:47:37Z

Subject: Issue Encountered While Annotating the Red Deer Genome Using Transcriptome Data

Dear Team,

I hope this message finds you well. I encountered an issue while attempting to annotate the red deer genome (GCA_910594005.1) using transcriptome data obtained from the genome's full annotation report (https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Cervus_elaphus/100/). I downloaded the relevant files locally and followed the usual steps for genome annotation, but I ran into an error during execution.

Here’s a summary of the error I encountered:
Creating classifier.

Classifier internal state for EGAPx Test Assembly:
0: 407239/310=1313.67 407239/871=467.553
1: 1.01014e+06/3442=293.474 1.01014e+06/7073=142.816
M=[285 311; 229 2901]; PPV=0.553398; NPV=0.902894; ACC=0.854843

Allowing locusType-631 models: true
Initialized 10 patterns for attr_rule=538.
Initialized 36 patterns for attr_rule=489.
Initialized 6 patterns for attr_rule=989.
Initialized 11 patterns for attr_rule=986.
Initialized 6 patterns for attr_rule=987.
Initialized 5 patterns for attr_rule=988.
Outputting.
Initialized 70 patterns for attr_rule=869.
BPH to proks: 6.04947%
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178).

dditionally, here are the parameters I used for the annotation:

genome: /mnt/z/libin/BCLgenome/04Geneprediction/10NCBIanoation/RedDeer.fa
taxid: 9860
reads:

SRR8002944_f1.fastq
SRR8002944_r2.fastq
SRR8002957_f1.fastq
SRR8002957_r2.fastq
annotation_provider: BinLi
annotation_name_prefix: reddeer

Interestingly, I was able to successfully run the annotation with the sheep transcriptome on the sheep genome, which makes me wonder if this issue could be specific to cervid genomes. One possible explanation I’m considering is that the red deer genome might have undergone soft masking, which could affect the alignment and annotation process. Alternatively, the transcriptome data might not have been adequately quality-controlled before use.
I would greatly appreciate any insight or suggestions you may have to help resolve this issue.

Thank you very much for your time and assistance.

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61

bismarck1008 commented Nov 26, 2024

pstrope commented Nov 26, 2024

bismarck1008 commented Nov 26, 2024

etvedte commented Nov 26, 2024

bismarck1008 commented Nov 26, 2024

murphyte commented Nov 27, 2024

murphyte commented Nov 27, 2024

bismarck1008 commented Nov 27, 2024

pstrope commented Dec 3, 2024

bismarck1008 commented Dec 3, 2024

hjsbio commented Dec 19, 2024

murphyte commented Dec 19, 2024

fmassive commented Dec 24, 2024

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61

Comments

bismarck1008 commented Nov 26, 2024

pstrope commented Nov 26, 2024

bismarck1008 commented Nov 26, 2024

etvedte commented Nov 26, 2024

bismarck1008 commented Nov 26, 2024

murphyte commented Nov 27, 2024

murphyte commented Nov 27, 2024

bismarck1008 commented Nov 27, 2024

pstrope commented Dec 3, 2024

bismarck1008 commented Dec 3, 2024

hjsbio commented Dec 19, 2024

murphyte commented Dec 19, 2024

fmassive commented Dec 24, 2024