Major T/s improvement Use the Metal qmatmul MM kernels #2615

EricLBuehler · 2024-11-14T19:29:07Z

This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.

Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.

This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.

Besides utilizing the MM kernels, this PR also upstreams some required changes:

Adds GGUF bf16 support (originally)
Updates quantized Metal kernels to support bf16 (originally)
Sync GGML <> Candle Metal kernels (originally)

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

* Update kernels for metal bf16 * Fix typo * Check if have bfloat

* Test passes * All tests pass * Now all the tests really pass * Try out always using mm * Mirror llama.cpp metric * Mirror llama.cpp metric * Update test

EricLBuehler · 2024-11-14T21:30:40Z

@LaurentMazare if you could review, that would be great!

More benchmarks with some smaller models can be found here: EricLBuehler/mistral.rs#903 (comment)

EricLBuehler and others added 5 commits November 14, 2024 14:13

Add GGUF BF16 support (#17)

053e63a

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

Update kernels for metal bf16 (#19)

9fa0b21

* Update kernels for metal bf16 * Fix typo * Check if have bfloat

Sync ggml metal kernels (#33)

23dacf7

Metal qmatmul mat-mat product (#39)

885bd31

* Test passes * All tests pass * Now all the tests really pass * Try out always using mm * Mirror llama.cpp metric * Mirror llama.cpp metric * Update test

Update test

82fe8ea

Vaibhavs10 requested a review from LaurentMazare November 18, 2024 19:53

This was referenced Nov 22, 2024

Quantized much slower than llama.cpp with same model and settings... #1939

Open

Sync with GGML: add GGML bf16 support #2640

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major T/s improvement Use the Metal qmatmul MM kernels #2615

Major T/s improvement Use the Metal qmatmul MM kernels #2615

EricLBuehler commented Nov 14, 2024

EricLBuehler commented Nov 14, 2024

*Major T/s improvement* Use the Metal qmatmul MM kernels #2615

Are you sure you want to change the base?

*Major T/s improvement* Use the Metal qmatmul MM kernels #2615

Conversation

EricLBuehler commented Nov 14, 2024

EricLBuehler commented Nov 14, 2024

Major T/s improvement Use the Metal qmatmul MM kernels #2615

Major T/s improvement Use the Metal qmatmul MM kernels #2615