Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: EDGE3D update #478

Draft
wants to merge 3 commits into
base: develop
Choose a base branch
from
Draft

Conversation

pguthrey
Copy link
Contributor

@pguthrey pguthrey commented Sep 9, 2024

Summary

  • This PR is a refactoring and bugfix
  • It does the following:
    • TODO: Explores performance improvements to EDGE3D
    • TODO: Addresses the openMP target issues Fix EDGE3D kernel #406

@pguthrey pguthrey force-pushed the feature/guthrey1/edge_3d_update branch 3 times, most recently from 32723a7 to 702238f Compare September 11, 2024 22:29
@rhornung67
Copy link
Member

@pguthrey is this ready for review?

@pguthrey pguthrey marked this pull request as draft October 13, 2024 17:18
@pguthrey
Copy link
Contributor Author

@pguthrey is this ready for review?

Not just yet. Need to experiment more with different implementations. Will come back to this later.
Also - there are additional apps that are broken with respect to targetopenmp, so I think I will pull that out into another branch and work with others to fix those examples (unless this is no longer supported)

@pguthrey pguthrey force-pushed the feature/guthrey1/edge_3d_update branch from 050da7e to c9b19a6 Compare November 23, 2024 03:46
@pguthrey
Copy link
Contributor Author

pguthrey commented Nov 27, 2024

Here are the results of these changes. Good improvement for CUDA. Impossibly good improvement for HIP. I checked that the results are the same as the previous algorithm... but I might look more into what is going on with HIP.

CUDA Base_Seq Lambda_Seq RAJA_Seq Base_OpenMP Lambda_OpenMP RAJA_OpenMP Base_HIP Lambda_HIP RAJA_HIP
default default default default default default block_256 block_256 block_256
current impl 4.42E+01 4.45E+01 4.44E+01 8.98E-01 9.08E-01 9.04E-01 1.53E-01 1.53E-01 1.54E-01
new impl 3.05E+01 3.06E+01 3.06E+01 6.08E-01 6.11E-01 6.11E-01 7.06E-02 7.02E-02 7.04E-02
speedup current/new 1.5 1.5 1.5 1.5 1.5 1.5 2.2 2.2 2.2
speedup openmp/kernel 1.0 1.0 1.0 8.6 8.7 8.6
HIP Base_Seq Lambda_Seq RAJA_Seq Base_OpenMP Lambda_OpenMP RAJA_OpenMP Base_HIP Lambda_HIP RAJA_HIP
default default default default default default block_256 block_256 block_256
current impl 1.74E+01 1.74E+01 1.74E+01 4.36E-01 3.79E-01 4.44E-01 1.34E-01 9.64E-02 9.73E-02
new impl 1.56E+01 1.54E+01 1.54E+01 3.51E-01 3.70E-01 4.01E-01 1.03E-03 9.90E-05 9.10E-05
speedup current/new 1.1 1.1 1.1 1.2 1.0 1.1 130.8 974.0 1069.0
speedup new openmp/kernel 1.0 0.9 0.9 341.7 3541.4 3852.7

@MrBurmark
Copy link
Member

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

@pguthrey
Copy link
Contributor Author

Perhaps there could still be register spilling with cuda or something like that that is making a dramatic difference. We'll have to look at the instructions to see what happened.

That makes some sense. If I add the memory needed by the vectors and the matrix together I get

12*13/2 + 3*12 + 3*12 = 150 > 128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants