Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype' #61

Open
bismarck1008 opened this issue Nov 26, 2024 · 12 comments

Comments

@bismarck1008
Copy link

python3 /data/bio-software/egapx/ui/egapx.py ../data/input_D_farinae_small.local5.yaml -e docker -o GCA_040085125.1_ASM4008512v1_out2 -w ./temp

[6f/72e38f] process > egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype [100%] 4 of 4, failed: 4, retries: 3 ✘

ERROR ~ Error executing process > 'egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype'

Caused by:
Process egapx:annot_proc_plane:gnomon_biotype:run_gnomon_biotype terminated with an error exit status (3)

Command executed:

mkdir -p output
mkdir -p ./asncache/
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i swissprot.asnb -oseq-ids spids -split-sequences
prime_cache -cache ./asncache/ -ifmt asnb-seq-entry -i gnomon_wnode.out -oseq-ids gnids -split-sequences
lds2_indexer -source genome/ -db LDS2
echo "hits.diamond.asn" > raw_blastp_hits.mft
merge_blastp_hits -asn-cache ./asncache/ -nogenbank -lds2 LDS2 -input-manifest raw_blastp_hits.mft -o prot_hits.asn
echo "gnomon_wnode.out" > models.mft
echo "prot_hits.asn" > prot_hits.mft
echo "" > splices.mft
if [ -z "" ]
then
gnomon_biotype -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_hits prot_hits.mft -prot_splices sp
lices.mft -reftrack-server 'NONE' -allow_lt631 true
else
gnomon_biotype -gc gencoll.asn -asn-cache ./asncache/ -lds2 ./LDS2 -nogenbank -gnomon_models models.mft -o output/biotypes.tsv -o_prots_rpt output/prots_rpt.tsv -prot_denylist -prot_hits prot_hits.mft
-prot_splices splices.mft -reftrack-server 'NONE' -allow_lt631 true
fi

Command exit status:
3

Command output:
(empty)

Command error:
Prefetching 4358 bioseqs
Prefetching 4605 bioseqs
Prefetching 5119 bioseqs
Prefetching 5021 bioseqs
Prefetching 4387 bioseqs
Prefetching 4726 bioseqs
Prefetching 4029 bioseqs
Prefetching 4442 bioseqs
Prefetching 4865 bioseqs
Prefetching 3001 bioseqs
Prefetching 4831 bioseqs
Prefetching 4484 bioseqs
Prefetching 2711 bioseqs
Prefetching 1272 bioseqs
Prefetching 1170 bioseqs
Second-pass: computing bestness scores

Starting.
Fetching Gnomon model data.
Loading GC-Assembly.
Taxon is invertebrate or plant - will allow more coding models
Loading protein hits
Skipped 19932 protein hits without corresponding CDS features
Processed 274631 hits; accepted 141526; 24500 are RBPH
Loading protein data.
Retrieving attributes for 43024 prots
Fetching next batch of 10000
Fetching next batch of 10000
Fetching next batch of 10000
Fetching next batch of 10000
Creating classifier.

Classifier internal state for EGAPx Test Assembly:
0: 907233/264=3436.49 907233/1442=629.149
1: 1.06183e+06/3514=302.172 1.06183e+06/7267=146.117
M=[730 326; 398 2826]; PPV=0.64659; NPV=0.896289; ACC=0.830647

Allowing locusType-631 models: true
Initialized 10 patterns for attr_rule=538.
Initialized 36 patterns for attr_rule=489.
Initialized 6 patterns for attr_rule=989.
Initialized 11 patterns for attr_rule=986.
Initialized 6 patterns for attr_rule=987.
Initialized 5 patterns for attr_rule=988.
Outputting.
Initialized 70 patterns for attr_rule=869.
BPH to proks: 5.88253%
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)

Work dir:
/data/dell/CNI.2024.10.5/2.anotation/test/temp/6f/72e38fd113038b2726553d1e7b22e0

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

-- Check 'GCA_040085125.1_ASM4008512v1_out2/nextflow.log' file for details

@pstrope
Copy link
Contributor

pstrope commented Nov 26, 2024

Hi,
This error

Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)

means your genome assembly is contaminated with prokaryote sequences. Please run FCS on it first to clean it up.
See https://github.com/ncbi/fcs/wiki

Pooja

@bismarck1008
Copy link
Author

Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.

Hi, This error

Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178)

means your genome assembly is contaminated with prokaryote sequences. Please run FCS on it first to clean it up. See https://github.com/ncbi/fcs/wiki

Pooja

Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.

@etvedte
Copy link
Contributor

etvedte commented Nov 26, 2024

Hi,

genome uses data from the NCBI database, so there should be no contamination.

Can you provide your input yaml input_D_farinae_small.local5.yaml for this run? I want to know how you are running this and what the source genome is to produce this log.

But if I type in the class number of the odd-footed species, it works.

What do you mean by this?

Eric

@bismarck1008
Copy link
Author

Hi,

genome uses data from the NCBI database, so there should be no contamination.

Can you provide your input yaml input_D_farinae_small.local5.yaml for this run? I want to know how you are running this and what the source genome is to produce this log.

But if I type in the class number of the odd-footed species, it works.

What do you mean by this?

Eric

genome: /data/dell/CNI.2024.10.5/1.deer.ref/GCA_040085125.1_ASM4008512v1_genomic.fna
taxid: 9863
reads:

  • /data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R1.fq.gz
  • /data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R2.fq.gz

During a successful run, I mistakenly changed the taxi ID to 1618199

@murphyte
Copy link

Thank you very much. But if I type in the class number of the odd-footed species, it works. And the genome uses data from the NCBI database, so there should be no contamination.

Public assemblies aren't necessarily clean of contamination, but in this case we're not reporting anything:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/040/085/125/GCA_040085125.1_ASM4008512v1/GCA_040085125.1_ASM4008512v1_fcs_report.txt

and given the nature of the assembly (HiFi, very few contigs) and species, I agree it's unlikely that there's much if any residual contamination.

During a successful run, I mistakenly changed the taxi ID to 1618199

I'm surprised that had an effect. The only thing I can think of that it would influence is the starting hmm parameters, which could ultimately result in slightly different final parameters but the training cycle should have resulted in minimal differences. We'll investigate further. Thanks for the report.

@murphyte
Copy link

Also, could you clarify some details about your config?
I see the name "input_D_farinae_small.local5.yaml", which is similar to our test config, but GCA_040085125.1 for the output (a deer). But it also output "Taxon is invertebrate or plant - will allow more coding models" in the log.

  1. which taxid is your original run with? GCA_040085125.1 is txid 9863
  2. what's the FASTA? Is it for sika deer GCA_040085125.1?

If your original run was on the deer GCA_040085125.1 FASTA, using txid 6954 (a tick), then it would have skipped hmm parameter training and used the tick hmm parameters directly. That could have a fairly large difference compared to using txid 1618199, especially at this stage.

@bismarck1008
Copy link
Author

The parameters that have been successfully run by chance are

genome: /data/dell/CNI.2024.10.5/1.deer.ref/GCA_040085125.1_ASM4008512v1_genomic.fna
taxid: 1618199
reads:

- /data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R1.fq.gz
- /data/dell/raw_data/2024.9.23_Deer/rna/PRJNA778640_trim/clean.SRR16848205_R2.fq.gz

1618199 represents an animal of the genus Equus
The FASTA is GCA_040085125.1_ASM4008512v1_genomic.fna for sika deer.
Another thing that needs to be added is that the network here is unstable, and it is unclear whether the automatically downloaded network data is complete.

@pstrope
Copy link
Contributor

pstrope commented Dec 3, 2024

Hi @bismarck1008, we were able to reproduce your error. We have traced the bug and will be releasing a new version with the bug fix shortly. Thank you for your patience, and also for reporting so that we can make better EGAPx!

@bismarck1008
Copy link
Author

It is a pleasure to participate in this meaningful work, and thank you very much for your efforts

@hjsbio
Copy link

hjsbio commented Dec 19, 2024

Dear all,
I have encountered a similar problem:
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178).

Since the new version has not been released yet, could you please provide a workaround for this issue? Alternatively, could you let me know when the new version (container image) will be published?

@murphyte
Copy link

Unfortunately we don't have a workaround for v0.3.1. We may be able to release a new version soon, although the holidays and some other issues outside of our control may cause some delays.

@fmassive
Copy link

Subject: Issue Encountered While Annotating the Red Deer Genome Using Transcriptome Data

Dear Team,

I hope this message finds you well. I encountered an issue while attempting to annotate the red deer genome (GCA_910594005.1) using transcriptome data obtained from the genome's full annotation report (https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Cervus_elaphus/100/). I downloaded the relevant files locally and followed the usual steps for genome annotation, but I ran into an error during execution.

Here’s a summary of the error I encountered:
Creating classifier.

Classifier internal state for EGAPx Test Assembly:
0: 407239/310=1313.67 407239/871=467.553
1: 1.01014e+06/3442=293.474 1.01014e+06/7073=142.816
M=[285 311; 229 2901]; PPV=0.553398; NPV=0.902894; ACC=0.854843

Allowing locusType-631 models: true
Initialized 10 patterns for attr_rule=538.
Initialized 36 patterns for attr_rule=489.
Initialized 6 patterns for attr_rule=989.
Initialized 11 patterns for attr_rule=986.
Initialized 6 patterns for attr_rule=987.
Initialized 5 patterns for attr_rule=988.
Outputting.
Initialized 70 patterns for attr_rule=869.
BPH to proks: 6.04947%
Error: (CException::eUnknown) Too many protein hits to proks (GP-23178)
Error: (106.16) Application's execution failed (CException::eUnknown) Too many protein hits to proks (GP-23178).

dditionally, here are the parameters I used for the annotation:

genome: /mnt/z/libin/BCLgenome/04Geneprediction/10NCBIanoation/RedDeer.fa
taxid: 9860
reads:

  • SRR8002944_f1.fastq
  • SRR8002944_r2.fastq
  • SRR8002957_f1.fastq
  • SRR8002957_r2.fastq
    annotation_provider: BinLi
    annotation_name_prefix: reddeer

Interestingly, I was able to successfully run the annotation with the sheep transcriptome on the sheep genome, which makes me wonder if this issue could be specific to cervid genomes. One possible explanation I’m considering is that the red deer genome might have undergone soft masking, which could affect the alignment and annotation process. Alternatively, the transcriptome data might not have been adequately quality-controlled before use.
I would greatly appreciate any insight or suggestions you may have to help resolve this issue.

Thank you very much for your time and assistance.

Best regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants