Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: permanentFail during job blast #44

Open
egonozer opened this issue Jun 23, 2023 · 4 comments
Open

[BUG]: permanentFail during job blast #44

egonozer opened this issue Jun 23, 2023 · 4 comments

Comments

@egonozer
Copy link

Describe the bug
I am running fcs-adaptor.sh v0.4.0 using Singularity on several genome assemblies generated using SPAdes v3.15.5. Nearly all of them run without error, but I consistently get a permanentFail error for some right after the [job blast] $ vecscreen \ ... command is run. I was getting the same error with fcs-adaptor v0.2.3, but upgrading to v0.4.0 didn't help. When I modified the fcs-adaptor.sh script to include the --debug statement, the last directory in the debug folder named contains the vecscreen.log file which has the following error message:

91501/000/0000/P  CF98656D495AB571 0003/0003 2023-06-23T09:25:28.132666 quser31         UNK_CLIENT      UNK_SESSION              vecscreen Error: VECSCREEN "vecscreen_app.cpp", line 149: CVecScreenApp::Run() --- Blast Error: Near line 0, the local id is too long.  Its length is 51 but the maximum allowed local id length is 50.  Please find and correct all local ids that are too long.

Looking at the file in the debug folder that was used as input for vecscreen, split_fasta.fna, here are the first few sequence headers:

>lcl|NODE_1_length_1090171_cov_39.792984__1_500000
>lcl|NODE_1_length_1090171_cov_39.792984__499501_999500
>lcl|NODE_1_length_1090171_cov_39.792984__999001_1090171
>lcl|NODE_2_length_887790_cov_41.364514__1_500000
>lcl|NODE_2_length_887790_cov_41.364514__499501_887790
>lcl|NODE_3_length_557874_cov_41.529717__1_500000
>lcl|NODE_3_length_557874_cov_41.529717__499501_557874
>NODE_4_length_463023_cov_41.874365
>NODE_5_length_417001_cov_40.711275
>NODE_6_length_369796_cov_39.096108

So it looks like the output of the fasta_split application is generating some sequence IDs that are too long for vecscreen. Is this something that can be modified in fcs-adapator? I really don't want to have to take the step of renaming all my input files with shorter sequence identifiers before running them through fcs_adaptor.

Thanks!

To Reproduce
./run_fcsadaptor.sh --fasta-input ../spades/1148/contigs.fasta --output-dir test --prok --container-engine singularity --image fcsadaptor_0.4.0/fcs-adaptor.sif

Software versions (please complete the following information):

  • RHEL 7.9 (Maipo)
  • Singularity version 3.8.1
  • Docker or Singularity FCS image version:
$ singularity inspect fcsadaptor_0.4.0/fcs-adaptor.sif
org.label-schema.build-date: Friday_18_November_2022_19:10:40_UTC
org.label-schema.schema-version: 1.0
org.label-schema.usage.singularity.deffile.bootstrap: docker
org.label-schema.usage.singularity.deffile.from: us-east4-docker.pkg.dev/ncbi-seqplus-rodr-build-res/ncbi-cgr/fcs/av_screen_x:develop-latest
org.label-schema.usage.singularity.version: 3.4.0-1

Log Files
Attached
debug.y6aphyfv.tar.gz

@etvedte
Copy link
Contributor

etvedte commented Jul 6, 2023

Hello,

Apologies for the delayed response. We currently have a work ticket documenting this issue and will update this GitHub issue when we have a fix in place. For the timeliest processing of all your assemblies, I would recommend shortening the seqids.

@ptrebert
Copy link

@etvedte
I believe I am also running into this - can you please update on what the status here is?

@ptrebert
Copy link

Following @egonozer example, I also added the debug switch and can confirm that it's the same problem.

I already raised some logging-related remarks in another issue ( #63 ) and I think this also applies here: the log message actually explaining what's wrong should by all means show up in an easily accessible and obvious log file / in captured stderr. It seems that in this case all the relevant log output is send to temporary locations that are being deleted after the job failed.

@etvedte
Copy link
Contributor

etvedte commented Mar 15, 2024

Hello @ptrebert,

I looked in to this some more. I was able to reproduce this specific issue. Apparently it only occurs when the vecscreen BLAST search identifies a hit in a chunked region when the original seq_id plus the coordinates from fasta_split exceed 50 characters. In other words, if your sequence is clean, then so long as the original seq_ids are <=50 you should be good. You could try to confirm this observation by shortening the seq_ids and then running FCS-adaptor. If you don't see any contamination calls in that situation, I would be surprised.

As far as remedying this issue, it is a work ticket with lower priority since there is a workaround (i.e. modifying the FASTA headers). Are you using SPAdes, or something else? As far as I'm aware most assemblers should be producing more modest-sized headers. Submissions to NCBI also require seq_ids <= 50 characters. While we do advertise a benefit of these tools is that you can screen without being tied to an NCBI submission, ultimately some of the requirements for submission screening were lifted for these tools.

As far as the logging issue, we can look into how we can better report this. There is the FASTA validator step that prints errors to screen when the original non-split seq_id headers are >50, but as mentioned above this is a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants