Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nalu-Wind is crashing on AMD GPUs for many reg tests #1323

Open
PaulMullowney opened this issue Nov 4, 2024 · 0 comments
Open

Nalu-Wind is crashing on AMD GPUs for many reg tests #1323

PaulMullowney opened this issue Nov 4, 2024 · 0 comments

Comments

@PaulMullowney
Copy link
Contributor

PaulMullowney commented Nov 4, 2024

Summary:
Many Nalu-Wind reg tests fail with aperture violations on various flavors of AMD GPUs.

MI300X failures:
I have been building on a non-cray system with MI300X in order to build more quickly. I have successfully built with:
- rocm/6.2.1
- openmpi/5.0.3-ucc1.3.x-ucx1.16.x-rocm6.2.0

Exawind-manager
[email protected]:PaulMullowney/exawind-manager.git
branch amd-debug
SHA: c4b7dd11

nalu-wind
[email protected]:PaulMullowney/nalu-wind.git
branch amd-debug
SHA: 9242f8b

Debug builds
MI300X : nalu-wind@master+rocm+tioga amdgpu_target=gfx942 build_type=Debug ^trilinos build_type=Debug
MI250 : nalu-wind@master+rocm+tioga amdgpu_target=gfx90a build_type=Debug ^trilinos build_type=Debug

with tolerances --abs-tol 1e-08 --rel-tol 1e-06

Test MI300X MI250
fsiTurbineSurrogate : Failed (4) Failed (1)
airfoilRANSEdgeNGPHypre.rst Passed Passed
ablNeutralNGPHypre Passed Failed (2)
ablNeutralNGPHypreSegregated Passed Failed (2)
multiElemCylinder Failed (3) Failed (3)
VOFZalDisk Failed (4) Failed (4)
airfoilSST_Gamma_Trans Passed Passed
oversetRotCylNGPHypre Passed Failed (4)
convTaylorVortex Failed (5) Failed (3)
unitTestGPU Passed Passed

Release build
MI300X : nalu-wind@master+rocm+tioga amdgpu_target=gfx942 build_type=Release ^trilinos build_type=Release
MI250 : nalu-wind@master+rocm+tioga amdgpu_target=gfx90a build_type=Release ^trilinos build_type=Release

Test MI300X MI250
fsiTurbineSurrogate : Failed (4) Failed (4)
airfoilRANSEdgeNGPHypre.rst Passed Passed
ablNeutralNGPHypre Failed (3) Failed (3)
ablNeutralNGPHypreSegregated Failed (3) Failed (3)
multiElemCylinder Failed (3) Failed (3)
VOFZalDisk Failed (4) Failed (4)
airfoilSST_Gamma_Trans Passed Passed
oversetRotCylNGPHypre Failed (4) Failed (4)
convTaylorVortex Failed (3) Failed (3)
unitTestGPU Passed Passed

Failure Pattern: Last Nalu output

  1. Memory access fault
    Time Step Count: 1 Current Time: 0.00455075
    dtN: 0.00455075 dtNm1: 0.00455075 gammas: 1 -1 0

  2. Memory access fault + HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION
    Realm::populate_variables_form_input() candidate input time: 0 for Realm: fluidRealm

  3. Memory access fault
    Parallel consistency noted in master/slave pairings:

  4. Memory access fault (+ HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION)
    Realm::create_output_mesh() End

  5. Runs to completion but doesn't generate .norm file????

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant