-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: optimize mul_mat for small values of N #10991
Conversation
Results from
The "before" results with coopmat1 or no coopmat were worse (I can shared if somebody is interested, but probably more useful to benchmark another GPU instead). Still thinking about where to put the cutoff for switching from mat_mul_vec to mat_mul. Seems like 8 would still be better using mat_mul_vec, and it doesn't cost anything except a little bit of compile time. Let's collect data on some other systems before finalizing anything. |
CC @netrunnereve, can you please help with some perf tests? |
Results with mul_mat_vec_max_cols == 8:
|
Here are the numbers on my RX 470, it's much faster with small ns compared to master. My card prefers a max cols of 8 or maybe something even larger. Master:
PR:
max cols of 8:
|
Giving my results with a 7900XTX running radv: This PR: main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
Master: main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 1, n_gpu_layers = 99, n_threads = 12, n_threads_batch = 12
Conclusion:
Let me know if you want any additional tests at different batch sizes. Thanks for making this PR! |
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.
905c05d
to
0247aaf
Compare
I didn't see a perf regression for N==1. I've updated the limit to 8, and removed "draft". |
Thanks @Mushoz . I've updated the limit to 8. Feel free to try 16, but I suspect the mat-mat mul path would work better for 16, at least if we tuned the matrix sizes (the current set of three sizes may be limiting...). |
Token generation is looking good at batch size 8 as well now!
Going to try and see if a limit of 16 makes more sense. As N=8 is now outperforming N=16/ |
What did you mean with this btw? I can clearly see a 0.5 token/sec drop on my N=1 result on this branch vs the master branch. I think that's outside the margin of error? |
I meant in my own local testing. Is this outside the margin of error for you? |
Limit at 16:
So seems like 8 is indeed the sweet spot |
I'm surprised it's worse at 16. Maybe using too many registers? You could try changing rm_kq and rm_stdq to 1, it may not make sense to do multiple rows with such a large value of N. |
Just to double check: I merely increased mul_mat_vec_max_cols from 8 to 16. That was the change you wanted me to test, right?
Any pointers what changes exactly I need to make? I am not very familiar with the llama.cpp codebase unfortunately. |
I ran the test-backend-ops perf benchmark on my devices for 1,2,3,4,5,8,16 and 32. Note that I set the limit to 16 to be able to see what difference it makes there. Looks good overall and I think 8 is a decent compromise between number of shaders to compile and performance. The x-axis indices map to these tests:
|
Just set these values to 1 at around line 1861 in ggml-vulkan.cpp. |
Slighty more detailing comparison on my 7900XTX: Master:
This PR (limit set to 16):
Conclusions:
Interesting master ROCM comparison (without FA):
Conclusions:
I will now make the suggested changes and re-run batch sizes 1 through 16 to see if setting those values to 1 is going to make any difference. |
Damn, I am stupid. I didn't find those variables because I was looking in the diff instead of the actual file. I was able to run the benchmarks now:
As you can see, the sharp performance drop-off at batch size 14, 15 and 16 is completely gone. Batchsize 13 performs very similar to the previous test. But for all batchsizes lower than 13, the performance is worse with this suggested change. Ideally we set rm_kq and rm_stdq only at those batchsizes that benefit from it, but:
|
Vulkan, ROCm and CUDA are all just APIs. Vulkan has a different focus, but it's also very low-level and (apart from being less convenient to use for compute-only programs) isn't inherently worse. Most relevant is the device code, not necessarily the API it's written in. But of course there are some limitations to Vulkan that the compute APIs don't have.
This kind of tuning is very common for GPUs, it's why libraries like cuBLAS are huge. They contain tons of specific kernels and the heuristics to pick them in an optimal way for different problem sizes and device capabilities. At some point we'll probably need to implement an auto-tuner to be able to keep up with the number of hardware configurations and tuning parameters in the Vulkan backend. It's already quite a lot. |
This is kinda going offtopic, so please let me know if I should move this conversation elsewhere, but does that mean ROCM should be able to get similar performance at batch sizes 1 through 8 (especially N=1 is severely lacking to be honest) with optimization within llama.cpp itself? Or did I misunderstand you? |
Yeah, the ROCm backend is basically using the CUDA code. It's mostly tuned for Nvidia, so AMD performance is not optimal. But so far there is no developer willing to put in the time to work on it. You can see the code selecting different matmul (which is always the most relevant operation for performance) variants in |
I ran the batched bench with llama 8b q4_0 for my devices as well to gather some more data for tuning. RTX 3090Master:
PR:
Radeon RX 6800 XTMaster:
PR:
Radeon Pro VIIMaster:
PR:
Intel A770Master:
PR:
(Performance got so low that I stopped the test) It seems something around 13 is optimal for RTX 3090, around 22 for Radeon Pro VII and 7 for A770. On RX 6800 XT I reached the maximum batch size of 28 that the benchmark offered and still didn't reach the point where the matmul shader got more efficient. Edit: But this heavily depends on quant complexity. With q4_0 the matrix vector shader gets to much larger n with good performance compared to q4_k_s, at least on AMD. |
I don't know what exactly the batched-bench is measuring, but I noticed that the TG results are affected by the
Thanks, I think it's very likely that these cases were running out of registers when doing so many rows*cols. I don't know much about how speculative decoding is used, how interesting are the n=9 to 16 cases? I think we should go with this PR as-is right now and we could always tune it further in the future. |
Even with the columns set to 16 and the rows set to 4 this actually doesn't use that many registers. With Q4_0 and 64 subgroup size/4 rows/16 columns I'm getting for GCN 54/256 vector registers used, and 44 for Q8_0. For Q4_K and Q6_K it's in the 30 register range. |
That might just be the prompt size affecting tg. Basically a larger kv cache means more calculations for each token, which slows down tg. But that should not be affected by pp speed. There's definitely still a lot of room for tuning in the matrix multiplication shader, yes. If you have suggestions which directions I could investigate let me know. |
How can this be less than 64?
Getting the large tile size working (or understanding why it would be slow) is probably the first step. The medium tile size may not be large enough to avoid being bandwidth limited. But it also occurred to me that this might be comparing an fp16 matmul in vulkan vs an int8 matmul in rocm. In which case it's less surprising to be slower. |
Sharing my experience from the Metal backend in case it could be useful. Tuning the batch threshold between mat-vec and mat-mat can lead to some gains for small batches but keep in mind that there are 4 factors into play:
Back when I first realized this for the Metal backend (#3524 (comment)) I was also thinking along the lines of auto-tuning the BS threshold per-device and per-model, but it seems very complicated to actually implement this in some reasonable manner. Eventually, I believe I found a good solution in #10581. We now essentially have 3 types of matrix multiplication kernels in the Metal backend:
This results in universally good performance across a wide range of Apple devices and model sizes. There are still some small gains from manually tuning the BS thresholds per device and per model, but the default performance is overall good. I don't know if this is the best way to do it and it's still far from the theoretical linear scaling that we would ideally like to achieve at BS <= 8. Also not sure how applicable this approach is for the Vulkan backend - probably depends on what vector/matrix data types are available. Pinging @JohannesGaessler in case he wants to give a short summary of what was done in the CUDA backend for small-batch sizes, since I believe the performance is quite good there. |
In the CUDA backend there are in essence three ways to do matrix multiplications:
On most NVIDIA GPUs MMVQ and MMQ are used by default for all batch sizes. On V100s or some AMD GPUs where int8 tensor cores aren't available MMQ is only used up to a batch size of 64. For MMVQ I've found per-GPU tuning to not really be necessary since you're I/O-bound and to my knowledge it's possible to fully utilize I/O without fully utilizing all SMs. For MMQ I initially used one tile size per data type and GPU architecture but I've found that this is a bad approach. Currently the code precompiles template specializations with varying sizes in |
This PR is similar to 2, but the math is done at fp32. For Ada this still seems to be memory bandwidth limited. |
That's the maximum number of registers used per thread, so the entire subgroup would use 54*64=3456 registers total.
Methods 2 and 3 need |
How is it less than 64 per thread, since there are 4*16 accumulator values per thread? Unless the compiler is spilling them to memory, which would be surprising. |
You're right. At this point I have no idea where I got those numbers from (I probably loaded the wrong shader?) and I certainly can't reproduce them now 🤦♀️... I ran the tools again and here are the hopefully correct numbers for Q4_0 with 64 subgroup size and 4 rows. 16 columns: 128 registers 16 columns with manual unrolling disabled in 32 columns: 184 registers The register utilization in this case is high enough to reduce the number of subgroups that can be lined up in front of each core, but at least it's not overflowing and spilling to memory. RGA spits out a warning when there's spilling so the compiler shouldn't be hiding it. |
I saw no performance increase or even a performance drop when benchmarking the large tile size vs medium on AMD. I managed to get Radeon GPU Profiler to work, maybe that will give me a hint on why that is. @ggerganov @JohannesGaessler Thank you for the summaries of how matrix multiplication is handled in Metal and CUDA. Vulkan just has two kinds of shaders currently:
It would be interesting to compare the implementations (especially with Metal) in a like-for-like scenario. With CUDA that's easy, but with Metal we'd have to find a GPU with similar hardware specs to Apple's. Vulkan always has a little more difficulty since the hardware it runs on is not as uniform as it is for Metal and CUDA. AMD, Intel and Nvidia all have different architectures that offer different features and prefer different work sizes (not to even mention phones). I think a good next step would be looking into q8_1 for the activations and int8 for the multiplications, for general matrix multiply. As @netrunnereve mentioned, DP4A is available to Vulkan as part of the |
I forgot to mention: I got sidetracked with a refactor of the GGUF code but I am still working on llama.cpp training. I think one of the more relevant use cases will be training LoRAs on top of quantized models. Due to the high memory requirements of training good performance for small batch sizes will be doubly important (but the current int8-based CUDA code will not work for transposed matrices I think).
I should mention though that I have never been able to get more than ~40% utilization of int8 tensor cores (on RTX 3090/4090). The throughput of int8 is 2x that of FP16 so I have effectively only been able to achieve ~80% of the maximum theoretical FP16 throughput. This could simply be due to my own inadequacies and it's very possible that if I had used FP16 tensor cores the utilization would have been similarly low. For NVIDIA GPUs without tensor cores the use of
I can talk you through how to do it. |
This is what I have in mind to fix #10966. Currently Draft because it needs more perf testing, particularly to make sure that it doesn't regress perf when N==1.
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better.
Share some code for reducing the result values to memory in mul_mat_vec_base.
I'll put directed perf tests in a separate comment.