Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI tests fail with Segmentation Fault in Extrae 4.1.6 #105

Open
julianmorillo opened this issue Jun 13, 2024 · 7 comments
Open

MPI tests fail with Segmentation Fault in Extrae 4.1.6 #105

julianmorillo opened this issue Jun 13, 2024 · 7 comments

Comments

@julianmorillo
Copy link
Contributor

julianmorillo commented Jun 13, 2024

Running in Ubuntu 22.04.1 with OpenMPI/4.1.6-GCC-13.2.0, the make check gives the following error for the MPI tests:

{EESSI 20240402} jmorillo@arriesgado-2 /tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6 $ cat tests/functional/tracer/MPI/test-suite.log
==============================================================
   Extrae 4.1.6: tests/functional/tracer/MPI/test-suite.log
==============================================================

# TOTAL: 21
# PASS:  0
# SKIP:  0
# XFAIL: 0
# FAIL:  21
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: mpi_initfini_c_linked_1proc.sh
====================================

Welcome to Extrae 4.1.6
Extrae: Parsing the configuration file (extrae.xml) begins
Extrae: Tracing package is located on /home/harald/aplic/extrae/3.3.0rc
Extrae: Generating intermediate files for Paraver traces.
Extrae: MPI routines will NOT collect HW counters information.
Extrae: Dynamic memory instrumentation is disabled.
Extrae: Basic I/O memory instrumentation is disabled.
Extrae: System calls instrumentation is disabled.
Extrae: Parsing the configuration file (extrae.xml) has ended
Extrae: Intermediate traces will be stored in /tmp/eb/easybuild/build/Extrae/4.1.6/gompi-2023b/extrae-4.1.6/tests/functional/tracer/MPI
Extrae: Tracing mode is set to: Detail.
Extrae: Successfully initiated with 1 tasks and 1 threads

./trace-static.sh: line 9: 91270 Segmentation fault      (core dumped) $*
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[24861,1],0]
  Exit code:    139
--------------------------------------------------------------------------
@gllort
Copy link
Contributor

gllort commented Jun 13, 2024

I would suggest to upgrade first to version 4.1.7, there was a critical bug fix related to MPI tracing. Can you please try with 4.1.7 and let us know whether the issue is fixed in the new version?

@julianmorillo
Copy link
Contributor Author

julianmorillo commented Jun 14, 2024

Unfortunately, it is not fixed; it is the same Segmentation Fault problem. Just let me know if there is something more I can try (maybe in the line of commenting out //Backend_Flush_pThread (pthread_self()); as I did for the PTHREAD test).

@julianmorillo
Copy link
Contributor Author

julianmorillo commented Jun 14, 2024

Debugging the binary of the first MPI test with

jmorillo@arriesgado-6:~/arriesgado-jammy/extrae-4.1.7/tests/functional/tracer/MPI$ gdb --args .libs/mpi_initfini_c_linked

I obtained that:

(gdb) run
Starting program: /home/jmorillo/arriesgado-jammy/extrae-4.1.7/tests/functional/tracer/MPI/.libs/mpi_initfini_c_linked
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/riscv64-linux-gnu/libthread_db.so.1".
Welcome to Extrae 4.1.7
Extrae: Application has been linked or preloaded with Extrae, BUT neither EXTRAE_ON nor EXTRAE_CONFIG_FILE are set!
[Detaching after fork from child process 36899]
[New Thread 0x3ff4bff060 (LWP 36903)]
[New Thread 0x3ff41b6060 (LWP 36904)]

Thread 3 "mpi_initfini_c_" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3ff41b6060 (LWP 36904)]
0x0000003ff7fe48a4 in do_lookup_x (undef_name=undef_name@entry=0x3ff7e36830 <__func__.6> "writev", new_hash=new_hash@entry=633298886, old_hash=old_hash@entry=0x3ff41b50b8, ref=0x0, result=result@entry=0x3ff41b50c8, scope=0x3ff7fff560, i=1, version=version@entry=0x0, flags=flags@entry=0, skip=skip@entry=0x3ff7fd72b0, type_class=type_class@entry=0, undef_map=undef_map@entry=0x3ff7fd72b0) at ./elf/dl-lookup.c:363
363     ./elf/dl-lookup.c: No such file or directory.

Hope this can help understanding what the problem is. Printing a backtrace shows:

(gdb) bt
#0  0x000000155555d8a4 in do_lookup_x (undef_name=undef_name@entry=0x1555619830 <__func__.6> "writev", new_hash=new_hash@entry=633298886,
    old_hash=old_hash@entry=0x1558ba00b8, ref=0x0, result=result@entry=0x1558ba00c8, scope=0x1555578560, i=1, version=version@entry=0x0, flags=flags@entry=0,
    skip=skip@entry=0x155557d2b0, type_class=type_class@entry=0, undef_map=undef_map@entry=0x155557d2b0) at ./elf/dl-lookup.c:363
#1  0x000000155555e008 in _dl_lookup_symbol_x (undef_name=0x1555619830 <__func__.6> "writev", undef_map=0x155557d2b0, ref=0x1558ba0188, symbol_scope=<optimized out>,
    version=0x0, type_class=<optimized out>, flags=<optimized out>, skip_map=0x155557d2b0) at ./elf/dl-lookup.c:860
#2  0x00000015558acf40 in do_sym (handle=<optimized out>, name=0x1555619830 <__func__.6> "writev", who=0x15555c4356 <writev+84>, vers=vers@entry=0x0, flags=flags@entry=2)
    at ./elf/dl-sym.c:146
#3  0x00000015558ad118 in _dl_sym (handle=<optimized out>, name=<optimized out>, who=<optimized out>) at ./elf/dl-sym.c:195
#4  0x0000001555828bbc in dlsym_doit (a=a@entry=0x1558ba04b8) at ./dlfcn/dlsym.c:40
#5  0x00000015558ac86e in __GI__dl_catch_exception (exception=exception@entry=0x1558ba03f0, operate=0x1555828baa <dlsym_doit>, args=0x1558ba04b8)
    at ./elf/dl-error-skeleton.c:208
#6  0x00000015558ac8fc in __GI__dl_catch_error (objname=0x1558ba0458, errstring=0x1558ba0460, mallocedp=0x1558ba0457, operate=<optimized out>, args=<optimized out>)
    at ./elf/dl-error-skeleton.c:227
#7  0x0000001555828776 in _dlerror_run (operate=operate@entry=0x1555828baa <dlsym_doit>, args=args@entry=0x1558ba04b8) at ./dlfcn/dlerror.c:138
#8  0x0000001555828c1a in dlsym_implementation (dl_caller=<optimized out>, name=0x1555619830 <__func__.6> "writev", handle=0xffffffffffffffff) at ./dlfcn/dlsym.c:54
#9  ___dlsym (handle=handle@entry=0xffffffffffffffff, name=name@entry=0x1555619830 <__func__.6> "writev") at ./dlfcn/dlsym.c:68
#10 0x00000015555c4356 in writev (fd=<optimized out>, iov=0x1558ba0548, iovcnt=<optimized out>) at io_wrapper.c:1188
#11 0x0000001558adb74e in pmix_ptl_base_send_handler () from /opt/pmix/4.2.0/lib/libpmix.so.2
#12 0x0000001555db522c in ?? () from /home/jmorillo/arriesgado-jammy/extrae-4.1.7/src/tracer/.libs/libmpitrace-4.1.7.so

@gllort
Copy link
Contributor

gllort commented Jun 19, 2024

Can you please try reconfiguring Extrae with the flag "--disable-instrument-io" ? The backtrace suggests there's a problem intercepting the syscall "writev". This option disables the instrumentation of the whole family of I/O system calls, and this test will help us isolate the problem.

@julianmorillo
Copy link
Contributor Author

"--disable-instrument-io" did not solve the issue (still Segmentation Fault):

FAIL: mpi_initfini_c_linked_1proc.sh
====================================

Welcome to Extrae 4.1.7
Extrae: Parsing the configuration file (extrae.xml) begins
Extrae: Tracing package is located on /home/harald/aplic/extrae/3.3.0rc
Extrae: Generating intermediate files for Paraver traces.
Extrae: <counters> tag at <MPI> level will be ignored. This library does not support CPU HW counters.
Extrae: Dynamic memory instrumentation is disabled.
Extrae: Basic I/O memory instrumentation is disabled.
Extrae: System calls instrumentation is disabled.
Extrae: Parsing the configuration file (extrae.xml) has ended
Extrae: Intermediate traces will be stored in /home/jmorillo/arriesgado-jammy/extrae-4.1.7/tests/functional/tracer/MPI
Extrae: Tracing mode is set to: Detail.
Extrae: Successfully initiated with 1 tasks and 1 threads

./trace-static.sh: line 9: 1822586 Segmentation fault      (core dumped) $*
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[54668,1],0]
  Exit code:    139
--------------------------------------------------------------------------

@julianmorillo
Copy link
Contributor Author

The backtrace looks different, though...

jmorillo@arriesgado-5:~/arriesgado-jammy/extrae-4.1.7/tests/functional/tracer/MPI$ gdb --args .libs/mpi_initfini_c_linked
Reading symbols from .libs/mpi_initfini_c_linked...
(gdb) run
Starting program: /home/jmorillo/arriesgado-jammy/extrae-4.1.7/tests/functional/tracer/MPI/.libs/mpi_initfini_c_linked
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/riscv64-linux-gnu/libthread_db.so.1".
Welcome to Extrae 4.1.7
Extrae: Warning! EXTRAE_HOME has not been defined!.
Extrae: Generating intermediate files for Paraver traces.
Extrae: Intermediate files will be stored in /home/jmorillo/arriesgado-jammy/extrae-4.1.7/tests/functional/tracer/MPI
Extrae: Tracing buffer can hold 500000 events
Extrae: Tracing mode is set to: Detail.
Extrae: Successfully initiated with 1 tasks and 1 threads

[Detaching after fork from child process 1823937]
[New Thread 0x3fed744080 (LWP 1823941)]
[New Thread 0x3feccfb080 (LWP 1823942)]

Thread 3 "mpi_initfini_c_" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3feccfb080 (LWP 1823942)]
0x0000003ff7fe4adc in do_lookup_x (undef_name=undef_name@entry=0x3fed57b7be "pmix_bfrops_base_unpack", new_hash=new_hash@entry=2551185065, old_hash=old_hash@entry=0x3feccfa148, ref=0x3fed570e88, result=result@entry=0x3feccfa158, scope=<optimized out>, i=24, version=version@entry=0x0, flags=flags@entry=5, skip=skip@entry=0x0, type_class=type_class@entry=1, undef_map=undef_map@entry=0x2aaab1a9d0) at ./elf/dl-lookup.c:431
431     ./elf/dl-lookup.c: No such file or directory.
(gdb) bt
#0  0x0000003ff7fe4adc in do_lookup_x (undef_name=undef_name@entry=0x3fed57b7be "pmix_bfrops_base_unpack", new_hash=new_hash@entry=2551185065,
    old_hash=old_hash@entry=0x3feccfa148, ref=0x3fed570e88, result=result@entry=0x3feccfa158, scope=<optimized out>, i=24, version=version@entry=0x0, flags=flags@entry=5,
    skip=skip@entry=0x0, type_class=type_class@entry=1, undef_map=undef_map@entry=0x2aaab1a9d0) at ./elf/dl-lookup.c:431
#1  0x0000003ff7fe5008 in _dl_lookup_symbol_x (undef_name=0x3fed57b7be "pmix_bfrops_base_unpack", undef_map=undef_map@entry=0x2aaab1a9d0, ref=ref@entry=0x3feccfa210,
    symbol_scope=<optimized out>, version=0x0, type_class=type_class@entry=1, flags=<optimized out>, skip_map=skip_map@entry=0x0) at ./elf/dl-lookup.c:860
#2  0x0000003ff7fe903c in _dl_fixup (l=0x2aaab1a9d0, reloc_arg=<optimized out>) at ./elf/dl-runtime.c:95
#3  0x0000003ff7fea88e in _dl_runtime_resolve () at ../sysdeps/riscv/dl-trampoline.S:61
Backtrace stopped: frame did not save the PC

@julianmorillo
Copy link
Contributor Author

Just in case you have the chance, it is pretty easy to reproduce:

jmorillo@arriesgado-5:~/arriesgado-jammy/extrae-4.1.7$ module list
No Modulefiles Currently Loaded.
jmorillo@arriesgado-5:~/arriesgado-jammy/extrae-4.1.7$ ./configure --with-mpi=/apps/riscv/ubuntu/openmpi/4.1.5_gcc11.3.0/ --without-unwind --with-xml=/home/jmorillo/arriesgado-jammy/libxml2-v2.11.8-install --without-papi --enable-posix-clock --disable-instrument-io
jmorillo@arriesgado-5:~/arriesgado-jammy/extrae-4.1.7$ make
jmorillo@arriesgado-5:~/arriesgado-jammy/extrae-4.1.7$ make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants