[BUG] SIGBUS error on HPC using singularity #264

AdrianZrm · 2023-07-18T12:03:10Z

Hello,

I'm trying to run PGAP on an HPC cluster using singularity + slurm and I'm running into some troubles.

While pgap is installing/running fine with our "test genome" on the main machine that dispatches slurm jobs on the HPC nodes, it crashes when we submit our pgap script with slurm on any node via this machine...

The error I'm experiencing seems to be a memory related issue. Here is the part where the memory problem Bus error (Nonexisting physical address [0x7feb076e6090]) is described (from the cwltool.log file) :

[2023-07-18 02:14:54] DEBUG [job actual] initial work dir {}
[2023-07-18 02:14:54] INFO [job actual] /pgap/output/debug/tmp-outdir/io2d_7j8$ gp_makeblastdb \
    -nogenbank \
    -asn-cache \
    /pgap/output/debug/tmpdir/4mhhzx29/stg49a509eb-8e5d-4d39-9b1f-095eedb1f100/sequence_cache \
    -dbtype \
    nucl \
    -fasta \
    /pgap/output/debug/tmpdir/4mhhzx29/stg300d7729-965e-4569-99d5-a499b0758aec/adaptor_fasta.fna \
    -found-ids-output \
    found_ids.txt \
    -found-ids-output-manifest \
    found_ids.mft \
    -db \
    blastdb \
    -output-manifest \
    blastdb.mft \
    -title \
    'BLASTdb created by GPipe'
Added file: /pgap/output/debug/tmpdir/4mhhzx29/stg300d7729-965e-4569-99d5-a499b0758aec/adaptor_fasta.fna type 1
Stack trace (most recent call last):
#15   Object "", at 0xffffffffffffffff, in 
#14   Object "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-05-17.build6771/arch/x86_64/bin/gp_makeblastdb", at 0x407bd6, in _start
#13   Object "/usr/lib64/libc-2.17.so", at 0x7feb32929554, in __libc_start_main
#12   Object "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-05-17.build6771/arch/x86_64/bin/gp_makeblastdb", at 0x407a59, in main
#11   Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/corelib/ncbiapp.cpp", line 1014, in AppMain [0x7feb3584ddec]
#10   Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/corelib/ncbiapp.cpp", line 702, in x_TryMain [0x7feb3584a822]
#9    Object "/panfs/pan1.be-md.ncbi.nlm.nih.gov/gpipe/bacterial_pipeline/system/2023-05-17.build6771/arch/x86_64/bin/gp_makeblastdb", at 0x40d5a7, in CGpMakeBlastDbApp::Run()
#8    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/internal/gpipe/common/gp_blast_util.cpp", line 410, in AddSequence [0x7feb47d424dd]
#7    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_writer/writedb_impl.cpp", line 189, in AddSequence [0x7feb4003b80b]
#6    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_writer/writedb_impl.cpp", line 182, in AddSequence [0x7feb4003b7a0]
#5    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_writer/writedb_impl.cpp", line 159, in AddSequence [0x7feb4003b359]
#4    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_writer/writedb_impl.cpp", line 1145, in x_Publish [0x7feb40038362]
#3    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_writer/writedb_lmdb.cpp", line 52, in CWriteDB_LMDB [0x7feb40068f4c]
#2    Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_reader/seqdb_lmdb.cpp", line 201, in GetWriteEnv [0x7feb3fdc45c3]
#1  | Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/src/objtools/blast/seqdb_reader/seqdb_lmdb.cpp", line 97, in open
    | Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/include/util/lmdbxx/lmdb++.h", line 1335, in env_open
      Source "/export/home/gpipe/TeamCity/Agent1/work/427aceaa834ecbb6/ncbi_cxx/include/util/lmdbxx/lmdb++.h", line 361, in CBlastEnv [0x7feb3fdc40a6]
#0  | Source "/net/napme02/vol/ncbi_tools/lnx64_netopt/ncbi_tools/lmdb-0.9.24/src/BUILD/mdb.c", line 5019, in mdb_env_share_locks
    | Source "/net/napme02/vol/ncbi_tools/lnx64_netopt/ncbi_tools/lmdb-0.9.24/src/BUILD/mdb.c", line 4552, in mdb_env_pick_meta
      Source "/net/napme02/vol/ncbi_tools/lnx64_netopt/ncbi_tools/lmdb-0.9.24/src/BUILD/mdb.c", line 3941, in mdb_env_open [0x7feb3f411a98]
Bus error (Nonexisting physical address [0x7feb076e6090])
[2023-07-18 02:15:03] INFO [job actual] Max memory used: 25MiB
[2023-07-18 02:15:03] WARNING [job actual] was terminated by signal: SIGBUS
[2023-07-18 02:15:03] ERROR [job actual] Job error:
("Error collecting output for parameter 'found_ids': pgap/progs/gp_makeblastdb.cwl:105:15: Did not find output file with glob pattern: ['found_ids.txt'].", {})
[2023-07-18 02:15:03] WARNING [job actual] completed permanentFail
[2023-07-18 02:15:03] DEBUG [job actual] outputs {}
[2023-07-18 02:15:03] ERROR [step actual] Output is missing expected field file:///pgap/pgap/progs/gp_makeblastdb.cwl#actual/blastfiles
[2023-07-18 02:15:03] DEBUG [step actual] produced output {}
[2023-07-18 02:15:03] WARNING [step actual] completed permanentFail
[2023-07-18 02:15:03] INFO [workflow Create_Adaptor_BLASTdb] completed permanentFail
[2023-07-18 02:15:03] DEBUG [workflow Create_Adaptor_BLASTdb] outputs {
    "blastdb": null
}
[2023-07-18 02:15:03] DEBUG [step Create_Adaptor_BLASTdb] produced output {
    "file:///pgap/pgap/vecscreen/foreign_screening.cwl#foreign_screening/Create_Adaptor_BLASTdb/blastdb": null
}
[2023-07-18 02:15:03] WARNING [step Create_Adaptor_BLASTdb] completed permanentFail
[2023-07-18 02:15:03] INFO [workflow default_plane] completed permanentFail

Here is the slurm script we use to submit our job to the nodes :

#!/bin/bash -l

#SBATCH --partition=normal
#SBATCH --time=02:00:00
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=6G

export PATH=/usr/local/apptainer/bin:$PATH
export PGAP_INPUT_DIR=/beegfs/data/username/.pgap
export PATH=$PATH:$PGAP_INPUT_DIR
export TMPDIR=$(readlink -f /beegfs/data/username/PGAP/mytmp)

/beegfs/data/username/PGAP/pgap.py --no-self-update -n --no-internet -d -c 24 -o /beegfs/data/username/PGAP/OtziSlurm /beegfs/data/username/PGAP_GENOMES/TEST_HP_otzi/input.yaml -D singularity

Unfortunately, it doesn't matter whether we change the number of cpu and memory used or not.

HPC is running on Debian GNU/Linux 12 (bookworm), singularity with apptainer version 1.1.9-1.el9, slurm-wlm 22.05.8
Please find attached the cwltool.log with the tmp-outdir folder.

tmp-outdir.zip
cwltool.log

I can't use podman or docker on the cluster. Do you have any ideas or hints as to what I can do to make this work?

Best regards,
Adrian

The text was updated successfully, but these errors were encountered:

azat-badretdin · 2023-07-18T12:32:13Z

Thank you for your report, Adrian. We will have a look at this at earliest opportunity. I presume this was not a standard NCBI provided example input, correct?

AdrianZrm · 2023-07-18T12:55:59Z

Thanks for your feedback.

Yes indeed, it's not a provided example input. However it is a genome from an Helicobacter pylori strain that was successfully annotated multiple times on different machines using PGAP before.

azat-badretdin · 2023-07-18T13:15:21Z

Thanks.

whether we change the number of cpu and memory used or not.

Would you mind posting the range of cpu and memory parameters that you varied?

AdrianZrm · 2023-07-18T13:59:31Z

Sure,
From what I remember we tried from :
24 Cpus with 6Gb of mem per cpu to
16 Cpu with 6Gb of mem per cpu
12 cpu; 6Gb mem per cpu
1 cpu ; 6Gb mem per cpu

And since the options mem-per-cpu and mem are mutually exclusive we also tried
128Gb total mem, 12 cpus
128Gb total mem, 1 cpu

azat-badretdin · 2023-07-18T14:54:04Z

Thanks. Could you please confirm that in all cases first occurrence of permanentFail was on "job actual"?

AdrianZrm · 2023-07-18T15:27:15Z

Thanks. Could you please confirm that in all cases first occurrence of permanentFail was on "job actual"?

Yes it's always encountering a SIGBUS error when the command line :

INFO [job actual] /pgap/output/debug/tmp-outdir/_xijgzoj$ gp_makeblastdb \
    -nogenbank \
    -asn-cache \
    /pgap/output/debug/tmpdir/hvz6tkzg/stg14115606-2a1b-436e-b560-03bfa31712ea/sequence_cache \
    -dbtype \
    nucl \
    -fasta \
    /pgap/output/debug/tmpdir/hvz6tkzg/stg54c6e9e1-e346-48f0-a1a0-90ef11d7d728/adaptor_fasta.fna \
    -found-ids-output \
    found_ids.txt \
    -found-ids-output-manifest \
    found_ids.mft \
    -db \
    blastdb \
    -output-manifest \
    blastdb.mft \
    -title \
    'BLASTdb created by GPipe'

is executed throwing this error :

Bus error (Nonexisting physical address [0x7fe81040b090])
[2023-07-17 14:33:50] INFO [job actual] Max memory used: 49MiB
[2023-07-17 14:33:50] WARNING [job actual] was terminated by signal: SIGBUS
[2023-07-17 14:33:50] ERROR [job actual] Job error:
("Error collecting output for parameter 'found_ids': pgap/progs/gp_makeblastdb.cwl:105:15: Did not find output file with glob pattern: ['found_ids.txt'].", {})
[2023-07-17 14:33:50] WARNING [job actual] completed permanentFail
[2023-07-17 14:33:50] DEBUG [job actual] outputs {}

george-coulouris · 2023-08-15T15:16:51Z

@AdrianZrm,
Are you able to share your input assembly, HP_otzi.fna ?

AdrianZrm · 2023-08-21T07:43:49Z

Hello George @george-coulouris ,

I am able to share my input assembly, but I double checked, and we can't even pass the test genome "ASM2732v1" Mycoplasmoides genitalium G37 on the cluster. I am afraid that the problem is not related to our input assembly...

Either :
"/pathto/pgap.py -no-self-update -n --no-internet -d -c 16 -o Mycoplasma -g ASM2732v1.annotation.nucleotide.1.fasta -s 'Mycoplasmoides genitalium' -D singularity"
or
"/pathto/pgap.py --no-self-update -n --no-internet -d -c 16 -o /paththo/PGAP_RESULTS/Mycoplasma /pathto/PGAP_GENOMES/Mycoplasma/input.yaml -D singularity"

Gives the same output :
cwltool.log

Bus error (Nonexisting physical address [0x7fc357d4e090])
[2023-08-21 09:31:25] INFO [job actual] Max memory used: 44MiB
[2023-08-21 09:31:25] WARNING [job actual] was terminated by signal: SIGBUS
[2023-08-21 09:31:25] ERROR [job actual] Job error:
("Error collecting output for parameter 'found_ids': pgap/progs/gp_makeblastdb.cwl:105:15: Did not find output file with glob pattern: ['found_ids.txt'].", {})
[2023-08-21 09:31:25] WARNING [job actual] completed permanentFail

We're trying to check with some other labs that achieved to make PGAP work on their cluster with singularity what could be our issue. I'll give a follow up here If we find anything on our side.

Regards

george-coulouris · 2023-08-21T13:33:11Z

Thanks for the update- we haven't tested on Debian 12 yet, so we'll try that on our end as well.

george-coulouris added the PGAPX-1179 label Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] SIGBUS error on HPC using singularity #264

[BUG] SIGBUS error on HPC using singularity #264

AdrianZrm commented Jul 18, 2023

azat-badretdin commented Jul 18, 2023

AdrianZrm commented Jul 18, 2023

azat-badretdin commented Jul 18, 2023

AdrianZrm commented Jul 18, 2023 •

edited

Loading

azat-badretdin commented Jul 18, 2023

AdrianZrm commented Jul 18, 2023 •

edited

Loading

george-coulouris commented Aug 15, 2023

AdrianZrm commented Aug 21, 2023 •

edited

Loading

george-coulouris commented Aug 21, 2023

[BUG] SIGBUS error on HPC using singularity #264

[BUG] SIGBUS error on HPC using singularity #264

Comments

AdrianZrm commented Jul 18, 2023

azat-badretdin commented Jul 18, 2023

AdrianZrm commented Jul 18, 2023

azat-badretdin commented Jul 18, 2023

AdrianZrm commented Jul 18, 2023 • edited Loading

azat-badretdin commented Jul 18, 2023

AdrianZrm commented Jul 18, 2023 • edited Loading

george-coulouris commented Aug 15, 2023

AdrianZrm commented Aug 21, 2023 • edited Loading

george-coulouris commented Aug 21, 2023

AdrianZrm commented Jul 18, 2023 •

edited

Loading

AdrianZrm commented Jul 18, 2023 •

edited

Loading

AdrianZrm commented Aug 21, 2023 •

edited

Loading