Add some fast Metal MLX SDPA kernels #2584

EricLBuehler · 2024-10-29T12:17:24Z

This PR adds some MLX SDPA kernels on Metal.

I can observe about a 26% performance improvement with Llama 3.1 8b @ q4k and @ q8_0 when testing through mistral.rs on my Candle fork. I updated the quantized_llama.rs file here to use the new function.

This PR adds a function candle_nn::ops::sdpa. The MLX attention kernels don't support masking yet, so the performance gains are only for decoding on Metal. Once/if they do, I'll update them - otherwise we can explore using Flash Attention kernels for Metal from llama.cpp.

* Sketch the sdpa kernel * Add full sdpa kernel, * Add test * Add vectorized kernel for decoding * Update tests * Add some docs * Fix sdpa_vector names * Add softcapping for vectorized sdpa * Add softcapping for full sdpa * Add support for head dim 32, 96, 256 * Add support for head dim 32, 96, 256 * Update docs * Add update notice * Clippy and format

.vscode/settings.json

candle-metal-kernels/src/lib.rs

LaurentMazare · 2024-10-30T05:32:10Z

candle-nn/src/ops.rs

+            candle::bail!("query `n_heads` must be a multiple of `n_kv_heads`");
+        }
+
+        let k_head = k_l.dims()[k_l.dims().len() - 1];


k_l.dim(D::Minus1)? would be simpler and make for a better error message than a panic (the same applies to a bunch of places in this function)

I'm not sure Layout::dim exists, that is why I used Layout::dims? Perhaps we could add it.

You can get the shape which should have the dim methods.

I think the Shape also doesn't have the dim methods?

https://github.com/search?q=repo%3Ahuggingface%2Fcandle+%22fn+dim%22&type=code

Ah right, I've added it quickly as part of this PR, at least it should give better error messages rather than just out of bounds.

candle-nn/src/ops.rs

EricLBuehler · 2024-11-04T16:38:54Z

Latest commit fixes a bug reported in mlx, see ml-explore/mlx#1558.

EricLBuehler · 2024-11-06T21:35:09Z

Thank you!

* Add some fast Metal MLX SDPA kernels (#32) * Sketch the sdpa kernel * Add full sdpa kernel, * Add test * Add vectorized kernel for decoding * Update tests * Add some docs * Fix sdpa_vector names * Add softcapping for vectorized sdpa * Add softcapping for full sdpa * Add support for head dim 32, 96, 256 * Add support for head dim 32, 96, 256 * Update docs * Add update notice * Clippy and format * Conditional compilation for bf16 * Use it in quantized llama * Some review comments * Use set_params! * Remove unused * Remove feature * Fix metal sdpa for v stride * Remove comma * Add the dim method to layout and shape. --------- Co-authored-by: Laurent <[email protected]>

EricLBuehler and others added 3 commits October 29, 2024 06:34

Conditional compilation for bf16

49c7255

Use it in quantized llama

10d357a

Vaibhavs10 requested a review from LaurentMazare October 29, 2024 17:58

LaurentMazare reviewed Oct 30, 2024

View reviewed changes

EricLBuehler added 3 commits October 31, 2024 08:57

Some review comments

018bda7

Use set_params!

fa9cef3

Remove unused

56420cc

EricLBuehler requested a review from LaurentMazare October 31, 2024 13:35

EricLBuehler added 2 commits November 4, 2024 05:26

Remove feature

4124ae0

Fix metal sdpa for v stride

7d85ffc

EricLBuehler and others added 2 commits November 4, 2024 11:40

Remove comma

a2471b1

Add the dim method to layout and shape.

fcd2cb3

LaurentMazare approved these changes Nov 5, 2024

View reviewed changes

LaurentMazare merged commit e2b6b36 into huggingface:main Nov 5, 2024
9 of 10 checks passed

sneilan mentioned this pull request Nov 6, 2024

Cannot compile candle-transformers #2598

Closed

EricLBuehler deleted the candle_mlx_sdpa branch November 6, 2024 21:35

EricLBuehler mentioned this pull request Nov 12, 2024

[Question] Masked SDPA attention kernel? ml-explore/mlx#1582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some fast Metal MLX SDPA kernels #2584

Add some fast Metal MLX SDPA kernels #2584

EricLBuehler commented Oct 29, 2024 •

edited

Loading

LaurentMazare Oct 30, 2024

EricLBuehler Oct 31, 2024

LaurentMazare Nov 2, 2024

EricLBuehler Nov 4, 2024

LaurentMazare Nov 5, 2024

EricLBuehler commented Nov 4, 2024

EricLBuehler commented Nov 6, 2024

Add some fast Metal MLX SDPA kernels #2584

Add some fast Metal MLX SDPA kernels #2584

Conversation

EricLBuehler commented Oct 29, 2024 • edited Loading

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

EricLBuehler Oct 31, 2024

Choose a reason for hiding this comment

LaurentMazare Nov 2, 2024

Choose a reason for hiding this comment

EricLBuehler Nov 4, 2024

Choose a reason for hiding this comment

LaurentMazare Nov 5, 2024

Choose a reason for hiding this comment

EricLBuehler commented Nov 4, 2024

EricLBuehler commented Nov 6, 2024

EricLBuehler commented Oct 29, 2024 •

edited

Loading