Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*Major T/s improvement* Use the Metal qmatmul MM kernels #2615

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

EricLBuehler
Copy link
Member

This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.

Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.

This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.

Besides utilizing the MM kernels, this PR also upstreams some required changes:

EricLBuehler and others added 5 commits November 14, 2024 14:13
* Add GGUF bf16 type support

* Add non avx impl for vec_dot_bf16

* Fix from_u32

* Fix loading

* Fix dequant of bf16
* Update kernels for metal bf16

* Fix typo

* Check if have bfloat
* Test passes

* All tests pass

* Now all the tests really pass

* Try out always using mm

* Mirror llama.cpp metric

* Mirror llama.cpp metric

* Update test
@EricLBuehler
Copy link
Member Author

@LaurentMazare if you could review, that would be great!

More benchmarks with some smaller models can be found here: EricLBuehler/mistral.rs#903 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant