CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042
+221
−185
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi All,
I am working on improving llama.cpp's CUDA graph performance on behalf of NVIDIA.
In preliminary testing, we are seeing up to 3% of performance gain by overlapping CPU and GPU work, and by improving CPU -> GPU copy scheduling on a high end system. The changes are likely to be even more impactful on less capable hardware.
To pave the way for these changes (and to provide readable diffs), I first isolated the cosmetic changes in this PR.
This PR does not contain any changes in the logic. It merely slims down the
ggml_backend_cuda_graph_compute()
by moving certain loops and other subtasks of the original function into 5 new functions.These changes considerably improve the readability and future maintainability of this part of the CUDA backend.
Should I add prefixes to the new function names, and if so, what do you suggest?
@agray3 @mtavenrath