Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda: fix build errors when built without openmp #53

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

balay
Copy link
Contributor

@balay balay commented Feb 26, 2020

No description provided.

@balay
Copy link
Contributor Author

balay commented Feb 26, 2020

This change gets the build going. But now - I get runtime errors.. Not sure I understand this.

The reference to pdgstrs2_omp is suspicious. Does it still require openmp?

Thread 1 "ex19" received signal SIGSEGV, Segmentation fault.
0x00007fffd408ded4 in _int_malloc () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install blas-3.8.0-13.fc31.x86_64 lapack-3.8.0-13.fc31.x86_64 libX11-1.6.9-2.fc31.x86_64 libXau-1.0.9-2.fc31.x86_64 libgcc-9.2.1-1.fc31.x86_64 libgfortran-9.2.1-1.fc31.x86_64 libquadmath-9.2.1-1.fc31.x86_64 libxcb-1.13.1-3.fc31.x86_64 xorg-x11-drv-nvidia-cuda-libs-440.59-1.fc31.x86_64 zlib-1.2.11-20.fc31.x86_64
(gdb) where
#0  0x00007fffd408ded4 in _int_malloc () from /lib64/libc.so.6
#1  0x00007fffd408e50f in _int_memalign () from /lib64/libc.so.6
#2  0x00007fffd408f5cc in _mid_memalign () from /lib64/libc.so.6
#3  0x00007fffd4090c56 in posix_memalign () from /lib64/libc.so.6
#4  0x00007ffff607f08f in superlu_malloc_dist (size=size@entry=36) at /home/balay/petsc/arch-ci-linux-cuda-double/externalpackages/git.superlu_dist/SRC/memory.c:127
#5  0x00007ffff60bd980 in pdgstrs2_omp (k0=k0@entry=18, k=k@entry=31, Glu_persist=Glu_persist@entry=0x11311000, grid=grid@entry=0x11307318, Llu=Llu@entry=0x11312000, 
    Ublock_info=Ublock_info@entry=0x11873000, stat=0x7fffffffd300) at /home/balay/petsc/arch-ci-linux-cuda-double/externalpackages/git.superlu_dist/SRC/pdgstrf2.c:831
#6  0x00007ffff60b898f in pdgstrf (options=options@entry=0x11307350, m=m@entry=1600, n=n@entry=1600, anorm=anorm@entry=2.7527700833655691, LUstruct=LUstruct@entry=0x11307438, 
    grid=grid@entry=0x11307318, stat=stat@entry=0x7fffffffd300, info=0x7fffffffd2d0)
    at /home/balay/petsc/arch-ci-linux-cuda-double/externalpackages/git.superlu_dist/SRC/pdgstrf.c:1298
#7  0x00007ffff609a1cb in pdgssvx (options=options@entry=0x11307350, A=A@entry=0x113073e0, ScalePermstruct=ScalePermstruct@entry=0x11307410, B=B@entry=0x0, ldb=1600, 
    nrhs=nrhs@entry=0, grid=0x11307318, LUstruct=0x11307438, SOLVEstruct=0x11307458, berr=0x0, stat=0x7fffffffd300, info=0x7fffffffd2d0)
    at /home/balay/petsc/arch-ci-linux-cuda-double/externalpackages/git.superlu_dist/SRC/pdgssvx.c:1175
#8  0x00007ffff6c02be7 in MatLUFactorNumeric_SuperLU_DIST (F=0x11303a20, A=0x111d7200, info=<optimized out>)
    at /home/balay/petsc/src/mat/impls/aij/mpi/superlu_dist/superlu_dist.c:329
#9  0x00007ffff680fb79 in MatLUFactorNumeric (fact=0x11303a20, mat=0x111d7200, info=info@entry=0x111a9518) at /home/balay/petsc/src/mat/interface/matrix.c:3147
#10 0x00007ffff7152db8 in PCSetUp_LU (pc=0x111a1e80) at /home/balay/petsc/src/ksp/pc/impls/factor/lu/lu.c:126
#11 0x00007ffff711198e in PCSetUp (pc=0x111a1e80) at /home/balay/petsc/src/ksp/pc/interface/precon.c:894
#12 0x00007ffff737a147 in KSPSetUp (ksp=ksp@entry=0x1114ceb0) at /home/balay/petsc/src/ksp/ksp/interface/itfunc.c:376
#13 0x00007ffff737aa7f in KSPSolve_Private (ksp=ksp@entry=0x1114ceb0, b=b@entry=0x111d0bc0, x=x@entry=0x111c9880) at /home/balay/petsc/src/ksp/ksp/interface/itfunc.c:633
#14 0x00007ffff737dfc7 in KSPSolve (ksp=0x1114ceb0, b=b@entry=0x111d0bc0, x=x@entry=0x111c9880) at /home/balay/petsc/src/ksp/ksp/interface/itfunc.c:853
#15 0x00007ffff755e0ed in SNESSolve_NEWTONLS (snes=0x10fbb640) at /home/balay/petsc/src/snes/impls/ls/ls.c:225
#16 0x00007ffff74fc3b1 in SNESSolve (snes=0x10fbb640, b=b@entry=0x0, x=<optimized out>) at /home/balay/petsc/src/snes/interface/snes.c:4520
#17 0x0000000000404e70 in main (argc=<optimized out>, argv=<optimized out>) at ex19.c:161

@xiaoyeli
Copy link
Owner

I don't know why. This routine pdgstrs2_omp runs on CPU. It hasn't changed for a long time. It uses "omp task" parallel. Perhaps you can turn that off?

@balay
Copy link
Contributor Author

balay commented Mar 3, 2020

Sorry I don't understand openmp well. My build is without OpenMP - so I would think "omp task parallel" is not enabled.

When I enable openmp in the build - the test works.

@xiaoyeli
Copy link
Owner

xiaoyeli commented Mar 9, 2020

There was a mistake -- that particular OpenMP pragma was not enclosed in
#ifdef _OPENMP
#endif
That's why it is still executed even if openMP is not used. I fixed that in the new v6.3.0 release.

@balay
Copy link
Contributor Author

balay commented Mar 10, 2020

@xiaoyeli this error exists with v6.3.0 for me. [I retried again now - with v6.3.0 and get the same errors]

  • I need the patch in this PR - otherwise superlu_dist does not build for me [with openmp disabled].
  • once I use this patch - superlu_dist builds but I get the runtime error - with the gdb stack posted earlier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants