Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctest: chef3 test reported as passing when it has actually failed #468

Open
cwsmith opened this issue Dec 6, 2024 · 1 comment
Open
Labels

Comments

@cwsmith
Copy link
Contributor

cwsmith commented Dec 6, 2024

There is non-deterministic output of exit codes from the failing chef3 test in develop @ 81440fa. By adding the mpich option -print-all-exitcodes the exit codes from each process are printed. In 'most' runs one process exits with a non-zero error code and the other is zero, but there are occasional runs that return zero from both ranks which results in ctest reporting the test as passing.

One rank always writes PUMI error: Invalid rank in Comm_Pack (via reel_fail) and prints the stack (via reel_trace).

cmd)smithc11@checkers: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop $ ctest -V -R chef3 
UpdateCTestConfiguration  from :/space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/DartConfiguration.tcl
UpdateCTestConfiguration  from :/space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/DartConfiguration.tcl
Test project /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 142
    Start 142: chef3

142: Test command: /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpirun "-print-all-exitcodes" "-np" "2" "/space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef"
142: Working Directory: /space/cwsmith/pumi30Release/core/pumi-meshes//phasta/2-1-Chef-Tet-Part/run_sim
142: Test timeout computed to be: 10000000
142: PUMI Git hash 3.0.1
142: PUMI version 3.0.1 Git hash 81440fa6deec1405baa7a2cd5cbcd1d82350c1bb
142: "../../translated-model.smd" and "../../geom.spj" loaded in 0.000657 seconds
142: mesh bz2:../../mesh/ loaded in 0.026479 seconds
142: number of tet 3297 hex 0 prism 5742 pyramid 0
142: mesh entity counts: v 3934 e 16507 f 21613 r 9039
142: 
142: MeshAdapt: version 2.0 !
142: 
142: MeshAdapt: marked 5742 layer elements in 0.006888 seconds
142: 
142: MeshAdapt: input mesh: checked layer quality in 0.002428 seconds: 0 unsafe elements
142: 
142: MeshAdapt: boundary layer converted to tets in 0.090908 seconds
142: 
142: MeshAdapt: worst element quality is 3.076889e-07
142: 
142: MeshAdapt: mesh adapted in 0.163930 seconds
142: number of tet 20523 hex 0 prism 0 pyramid 0
142: mesh entity counts: v 3934 e 25354 f 41944 r 20523
142:   - verifying tags: solution_ver
142: mesh verified in 0.035642 seconds
142: planned Zoltan split factor 2 to target imbalance 1.010000 in 0.031767 seconds
142: mesh expanded from 1 to 2 parts in 0.007889 seconds
142: PUMI error: Invalid rank in Comm_Pack
142: signal 6 caught by pcu
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x13db00f]
142: /lib64/libc.so.6(+0x3e6f0)[0x7efe5263e6f0]
142: /lib64/libc.so.6(+0x8b94c)[0x7efe5268b94c]
142: /lib64/libc.so.6(raise+0x16)[0x7efe5263e646]
142: /lib64/libc.so.6(abort+0xd3)[0x7efe526287f3]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x13db10b]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x13db314]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x1366968]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x1367e0e]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x1368b8f]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x1368e49]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x132cc34]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x4b9692]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x4b9c84]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x4b79b7]
142: /lib64/libc.so.6(+0x29590)[0x7efe52629590]
142: /lib64/libc.so.6(__libc_start_main+0x80)[0x7efe52629640]
142: /space/cwsmith/pumi30Release/buildPumiOptonSimonOmegaoff_develop/test/chef[0x4b8625]
142: [[email protected]] Exit codes: [checkers.scorec.rpi.edu] 0,0
1/1 Test #142: chef3 ............................   Passed    0.40 sec

The following tests passed:
        chef3

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.40 sec
@cwsmith cwsmith added the bug label Dec 6, 2024
@cwsmith
Copy link
Contributor Author

cwsmith commented Dec 6, 2024

Removing this call to PCU_Protect(...)

pcu::Protect();
, which bypasses the 'pcu/reel/reel.c' signal handler, does not change the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant