CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042

aendk · 2025-01-02T12:46:15Z

Hi All,
I am working on improving llama.cpp's CUDA graph performance on behalf of NVIDIA.
In preliminary testing, we are seeing up to 3% of performance gain by overlapping CPU and GPU work, and by improving CPU -> GPU copy scheduling on a high end system. The changes are likely to be even more impactful on less capable hardware.

To pave the way for these changes (and to provide readable diffs), I first isolated the cosmetic changes in this PR.
This PR does not contain any changes in the logic. It merely slims down the ggml_backend_cuda_graph_compute() by moving certain loops and other subtasks of the original function into 5 new functions.

These changes considerably improve the readability and future maintainability of this part of the CUDA backend.

Should I add prefixes to the new function names, and if so, what do you suggest?

@agray3 @mtavenrath

…meters) to separate function for improved readability.

…aluation and capture to its own function.

… for improved readability.

aendk · 2025-01-02T15:47:57Z

FYI: setting Status to Draft whilst I investigate the failed tests.

ggerganov · 2025-01-04T14:17:21Z

FYI: setting Status to Draft whilst I investigate the failed tests.

I think you just need to make the functions static.

aendk added 5 commits December 19, 2024 10:40

Refactor: Moves cuda graph executable update step to separate function.

41a4d87

Refactor: Moves cuda graph update check to separate function.

4f4cc77

Refactor: Moves cuda graph maintenance (update or adjusting copy para…

60023f2

…meters) to separate function for improved readability.

Refactor: Improves structure and abstractions by moving cuda graph ev…

4e4add2

…aluation and capture to its own function.

Refactor: Moves node graph checks and copy ops to individual function…

3998c0d

… for improved readability.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 2, 2025

aendk marked this pull request as draft January 2, 2025 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042

CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042

aendk commented Jan 2, 2025

aendk commented Jan 2, 2025

ggerganov commented Jan 4, 2025

CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042

Are you sure you want to change the base?

CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042

Conversation

aendk commented Jan 2, 2025

aendk commented Jan 2, 2025

ggerganov commented Jan 4, 2025