Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use KernelAbstractions.jl #109

Merged
merged 14 commits into from
Oct 12, 2024
Merged

Use KernelAbstractions.jl #109

merged 14 commits into from
Oct 12, 2024

Conversation

lukem12345
Copy link
Member

Demonstrate the use of KernelAbstractions.jl to reduce code surface while supporting more architectures

@lukem12345 lukem12345 added the enhancement New feature or request label Oct 4, 2024
@lukem12345 lukem12345 self-assigned this Oct 5, 2024
@lukem12345 lukem12345 marked this pull request as ready for review October 7, 2024 03:59
@lukem12345
Copy link
Member Author

The overhead from KernelAbstractions.jl for the CPU backend seems minimal when run with @btime at the command line. Differing results from the “benchmark group” interface do not appear when run in this fashion. I’m going to look further into this discrepancy has any effect on TTFP, but it is likely some artifact of synchronization on benchmarking.

I’m going to look into whether the kernel ought to be created at the dec_wedge_product level and passed to dec_c_wedge_product and children explicitly, although initial spot-checking seems to indicate that the current organization is fine (i.e. the Julia compiler + KA.jl are sufficiently advanced to not re-compile the kernel each time).

I’m going to add support for Metal.jl as an explicit extension. Further, if UF RC has AMD GPUs available I’ll explicitly add those bindings as well. This consists of simply calling the appropriate array constructor on the terms to be passed to the kernel a la:

(CuArray.(wedge_cache[1:end-1])..., wedge_cache[end])

Of course, explicitly supporting any extensions is not strictly necessary anymore, as these functions can simply be passed (the equivalent of) CuArray and CuSparseMatrixCSC. This re-factor can hold for the time being, since eliminating the backend extensions would be better managed in the scope of a follow-up PR. This follow-up would truly make this library not only multi-backend, but also backend-agnostic. Supporting explicit extensions may still be a good-to-have for purposes of e.g. snoop-precompiling.

This current PR has not looked at writing more-performant kernels themselves. The nested-indexing pattern used in the current kernels (e.g. f[p[Int32(1), i]]) seems like low-hanging fruit to eliminate, but the particular implementation details of any particular kernel is of course abstracted away by the KernelAbstractions interface, with which this current PR is concerned. A follow-up PR should optimize these.

@lukem12345
Copy link
Member Author

Commit 453f5d5, which explicitly added support for the Apple Metal backend, works locally. But it looks like adding Metal.jl as a weak dep errors Windows and Linux at package precompile time, which should work. (And Metal.functional needs to get swapped out with a different check.)

@jpfairbanks
Copy link
Member

I think that Decapodes will want to have a high level functions for each backend that is like "convert all my initial conditions to CuArray, call simulate" and "convert the results back to CPU Arrays for analysis" so keeping the CUDA/Metal/AMD support as extensions on that package would be reasonable, even if we can remove it from CombinatorialSpaces.

@lukem12345
Copy link
Member Author

To circumvent issues relating to the Metal.jl extension throwing errors at precompile time, I’ll just refactor the operators such that they take can take an array constructor (e.g. MtlArray) mentioned in a previous comment. Instead of dispatching off of Val{:Metal}, they can take this array constructor as an argument directly, essentially keeping similar “dispatch” semantics. Apple Metal does not support sparse matrices directly. (Apple Accelerate does, but AppleAccelerate.jl does not support them yet.) For the time being, I’ll test the remaining operators with dense AplMatrix matrices.

@lukem12345 lukem12345 requested a review from GeorgeR227 October 11, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants