-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: scale caching for k quants + misc fixes #11081
base: master
Are you sure you want to change the base?
Conversation
scales[2] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 4]); | ||
scales[3] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 6]); | ||
sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]); | ||
barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look because the shaders you changed (q2_k, q3_k and q6_k) crash my Intel A770. It stops crashing if I remove this barrier.
If you add barriers, you need to make sure there are no early returns in the shader (undefined behaviour). In this case, we do, but that is easy to fix. Removing the early return does not fix the crash, so it's something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of loop iterations in the outermost loop is nonuniform, so this doesn't work as-is. It's probably fixable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize this. I can certainly fix it but the crash remains a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to fix this but unfortunately the extra branches make it run much slower than master at 18 t/s. Now all the threads enter the loop and hit the barrier even if they don't have a block to calculate.
---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ----------
index e1afd55e..a10c194d 100644
@@ -39,16 +39,22 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
}
}
- [[unroll]] for (uint i = ix; i < num_blocks_per_row; i += it_size) {
+ [[unroll]] for (uint i0 = 0; i0 < num_blocks_per_row; i0 += it_size) {
+ uint i = i0 + ix;
const uint y_idx = i * QUANT_K + y_offset;
[[unroll]] for (uint n = 0; n < num_rows; ++n) {
const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
- const FLOAT_TYPE d = FLOAT_TYPE(data_a[ib0 + i].d);
- sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
+ if (i < num_blocks_per_row)
+ sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
barrier();
+ if (i >= num_blocks_per_row)
+ continue;
+
+ const FLOAT_TYPE d = FLOAT_TYPE(data_a[ib0 + i].d);
+
uint32_t ql0_u32 = uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2]) | (uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 1]) << 16);
uint32_t ql32_u32 = uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 16]) | (uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 17]) << 16);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to fix this but unfortunately the extra branches make it run much slower than master at 18 t/s.
I'd guess that the branches cause the compiler to insert a wait on the scale load and it can't overlap other work with it.
scales[2] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 4]); | ||
scales[3] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 6]); | ||
sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]); | ||
barrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of loop iterations in the outermost loop is nonuniform, so this doesn't work as-is. It's probably fixable.
uvec4 s0_hi4 = uvec4(unpack8(s0_hi4_u32)); | ||
uvec4 s4_hi4 = uvec4(unpack8(s4_hi4_u32)); | ||
sccache[ix][0][itid] = FLOAT_TYPE(bitfieldExtract(uint(data_a[ib0 + i].scales[itid8]), int(v_im*4), 4)); // lower 8 bytes | ||
sccache[ix][1][itid] = FLOAT_TYPE(bitfieldExtract(uint(data_a[ib0 + i].scales[itid8+8]), int(v_im*4), 4)); // upper 8 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA GPUs are partly bottlenecked by the unpacking and conversion to float, so reusing work across threads may help (though I'm not sure if shared memory loads will be faster). The shared memory use and barrier also puts some constraints on how the compiler can reorder things when NUM_COLS is greater than one, so we'll need to perf test that too.
I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost (some will be applicable to other types, too). I'll share them in the next day or so and we can try them both out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I'm not sure if shared memory loads will be faster
For the speed of shared memory loads I generally use IQ4_NL as a guideline, which is nearly the same speed as Q4_0 (around 2% slower) despite the use of the LUT. The parsing of the 4 bit packed weights are the same but Q4_0 has an extra float conversion and subtraction versus the LUT indexing in IQ4_NL. My guess is that it takes only several clock cycles to do a shared memory read.
I mean theoretically a subgroup shuffle should be the best way to handle this, as the data would be directly copied between registers rather than having to go through shared memory. Interestingly though I saw in #10999 that shuffles were slower than shared memory when applied to the IQ4_NL LUT, and I didn't bother exploring that further.
As for reusing the conversions and parallelizing them across threads that should almost guarantee better performance unless the shared memory or barriers slow things down a lot. So basically it's a race between:
- load from main memory -> convert to float -> upload to shared memory -> barrier -> read shared memory x4
- load from main memory x4 (a smart GPU might be able to reduce this) -> convert to float x4 -> read from registers x4
I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost
I'm definitely interested in seeing those when they're ready 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost
Unfortunately this didn't pan out. I can get some decent gains in the directed tests, but in real models (where the L2 miss rate is much higher) it's bandwidth-limited and doesn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your changes are something like ced706d then that's giving me a 5% improvement in Q4_K inference speed as I'm compute bound.
model | size | params | backend | ngl | threads | main_gpu | sm | test | t/s |
---|---|---|---|---|---|---|---|---|---|
llama 8B Q4_K - Small (Master) | 4.36 GiB | 8.03 B | Vulkan | 100 | 8 | 1 | none | tg128 | 27.49 ± 0.07 |
llama 8B Q4_K - Small (PR) | 4.36 GiB | 8.03 B | Vulkan | 100 | 8 | 1 | none | tg128 | 28.82 ± 0.02 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had similar changes to the bit twiddling on the scales, I agree we should do those (and the same for q5_k). I was also using the "fast int to float" trick (in fp16x2) where you OR the integer into the mantissa. I've pushed the changes at jeffbolznv@33b2512 if you want to try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the same bit twiddling change on Q5_K I'm seeing a smaller 3% improvement, but it's still an improvement!
model | size | params | backend | ngl | threads | main_gpu | sm | test | t/s |
---|---|---|---|---|---|---|---|---|---|
llama 8B Q5_K - Small (Master) | 5.21 GiB | 8.03 B | Vulkan | 100 | 8 | 1 | none | tg128 | 22.39 ±0.02 |
llama 8B Q5_K - Small (PR) | 5.21 GiB | 8.03 B | Vulkan | 100 | 8 | 1 | none | tg128 | 23.02 ± 0.02 |
I was also using the "fast int to float" trick (in fp16x2) where you OR the integer into the mantissa.
That's an interesting trick. I think it's only worth doing on newer GPUs which pack the f16vec2 into a single 32 bit register and do two FP16 calculations at once, and if I quickly estimate the instruction counts on paper it only saves a few cycles compared to the regular unpack and vector conversion. On my RX 470 with no packed FP16 I get a much slower 23 t/s on Q4_K, and the shader straight up fails to compile on my W8100 as that card has no FP16 support. Are you seeing a big improvement on your 4070?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about 10% faster in the directed tests, but it doesn't improve real models because they're bandwidth limited. So I'm not pursuing it any further.
RTX 4070 results. Keep in mind there's a lot of variability in the results, but at first glance it seems like an improvement for Q3_K but worse for the others:
|
For multiple ns I'm seeing clear improvements with Q3_K and Q6_K, but Q2_K is much less consistent and is in some cases slower than master. PR:
Master:
I tried calculating the A * scale multiplication ahead of time for Q2_K, but it didn't do much. That also should reduce the number of shared memory reads as the products are stored in registers. A * scale multiplication cached in registers:
|
I'll post benchmarks at a later point, but this reduces performance on RTX 3090 for q2_k and q6_k. I see small improvements on Radeon Pro VII. Intel still crashes, but only in |
IMO the crash is still very likely related to the barriers in nonuniform control flow. It really needs to be fixed if we're going to use shared memory here. If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf (likely making it worse, I'd guess). |
To get rid of the branches we could just have the main i loop run with no checks as long as we have enough blocks remaining to use all threads, and then switch to a separate code path for the final multiplications. There's no need to redo the algorithm. |
…n use, plus some more optimizations
Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash? |
I tested the latest Q6_K changes on RTX 4070. For llama-bench with llama-2-7b.Q6_K, the perf is basically unchanged, which is not surprising since it's just memory bandwidth-limited. The directed perf results are more interesting:
So there's a nice boost for larger n, but it just falls off a cliff for n=8. I looked into this, and what's happening is the barriers are causing all the loads of the B matrix to be bunched together, and it's using too many registers. I tried moving all the B loads to the start of the function and saving them in local arrays, and that seems to resolve the issue:
|
We can make inference run a bit faster by extracting the scales in parallel and saving them to shared memory, where they'll be used by all the threads working on the superblock. This came out of the experiments in #10999.
This was not done for Q4_K and Q5_K as their scales are packed in a complicated way which makes this method even slower.
PR:
Master: