vulkan: scale caching for k quants + misc fixes #11081

netrunnereve · 2025-01-05T02:26:21Z

We can make inference run a bit faster by extracting the scales in parallel and saving them to shared memory, where they'll be used by all the threads working on the superblock. This came out of the experiments in #10999.

This was not done for Q4_K and Q5_K as their scales are packed in a complicated way which makes this method even slower.

PR:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   232.89 us/run - 117.44 MFLOP/run - 504.27 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   359.69 us/run - 117.44 MFLOP/run - 326.50 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   234.78 us/run - 117.44 MFLOP/run - 500.22 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   313.31 us/run - 117.44 MFLOP/run - 374.84 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   333.78 us/run - 117.44 MFLOP/run - 351.85 GFLOPS

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	100	8	1	none	tg128	24.78 ± 0.03
llama 8B Q3_K - Medium	3.74 GiB	8.03 B	Vulkan	100	8	1	none	tg128	21.98 ± 0.02
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	22.27 ± 0.01

Master:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   241.10 us/run - 117.44 MFLOP/run - 487.09 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   449.01 us/run - 117.44 MFLOP/run - 261.56 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   235.58 us/run - 117.44 MFLOP/run - 498.51 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   315.21 us/run - 117.44 MFLOP/run - 372.58 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   365.79 us/run - 117.44 MFLOP/run - 321.06 GFLOPS

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	100	8	1	none	tg128	22.15 ± 0.01
llama 8B Q3_K - Medium	3.74 GiB	8.03 B	Vulkan	100	8	1	none	tg128	18.97 ± 0.00
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	1	none	tg128	20.38 ± 0.00

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp

0cc4m · 2025-01-05T10:52:24Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp

-            scales[2] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 4]);
-            scales[3] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 6]);
+            sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
+            barrier();


I took a closer look because the shaders you changed (q2_k, q3_k and q6_k) crash my Intel A770. It stops crashing if I remove this barrier.

If you add barriers, you need to make sure there are no early returns in the shader (undefined behaviour). In this case, we do, but that is easy to fix. Removing the early return does not fix the crash, so it's something else.

The number of loop iterations in the outermost loop is nonuniform, so this doesn't work as-is. It's probably fixable.

Ah, I didn't realize this. I can certainly fix it but the crash remains a problem.

So I tried to fix this but unfortunately the extra branches make it run much slower than master at 18 t/s. Now all the threads enter the loop and hit the barrier even if they don't have a block to calculate.

---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ---------- index e1afd55e..a10c194d 100644 @@ -39,16 +39,22 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) { } } - [[unroll]] for (uint i = ix; i < num_blocks_per_row; i += it_size) { + [[unroll]] for (uint i0 = 0; i0 < num_blocks_per_row; i0 += it_size) { + uint i = i0 + ix; const uint y_idx = i * QUANT_K + y_offset; [[unroll]] for (uint n = 0; n < num_rows; ++n) { const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row; - const FLOAT_TYPE d = FLOAT_TYPE(data_a[ib0 + i].d); - sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]); + if (i < num_blocks_per_row) + sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]); barrier(); + if (i >= num_blocks_per_row) + continue; + + const FLOAT_TYPE d = FLOAT_TYPE(data_a[ib0 + i].d); + uint32_t ql0_u32 = uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2]) | (uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 1]) << 16); uint32_t ql32_u32 = uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 16]) | (uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 17]) << 16);

So I tried to fix this but unfortunately the extra branches make it run much slower than master at 18 t/s.

I'd guess that the branches cause the compiler to insert a wait on the scale load and it can't overlap other work with it.

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp

jeffbolznv · 2025-01-05T13:37:36Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp

-            scales[2] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 4]);
-            scales[3] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 6]);
+            sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
+            barrier();


The number of loop iterations in the outermost loop is nonuniform, so this doesn't work as-is. It's probably fixable.

jeffbolznv · 2025-01-05T13:58:49Z

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q2_k.comp

-            uvec4 s0_hi4 = uvec4(unpack8(s0_hi4_u32));
-            uvec4 s4_hi4 = uvec4(unpack8(s4_hi4_u32));
+            sccache[ix][0][itid] = FLOAT_TYPE(bitfieldExtract(uint(data_a[ib0 + i].scales[itid8]), int(v_im*4), 4)); // lower 8 bytes
+            sccache[ix][1][itid] = FLOAT_TYPE(bitfieldExtract(uint(data_a[ib0 + i].scales[itid8+8]), int(v_im*4), 4)); // upper 8 bytes


NVIDIA GPUs are partly bottlenecked by the unpacking and conversion to float, so reusing work across threads may help (though I'm not sure if shared memory loads will be faster). The shared memory use and barrier also puts some constraints on how the compiler can reorder things when NUM_COLS is greater than one, so we'll need to perf test that too.

I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost (some will be applicable to other types, too). I'll share them in the next day or so and we can try them both out.

though I'm not sure if shared memory loads will be faster

For the speed of shared memory loads I generally use IQ4_NL as a guideline, which is nearly the same speed as Q4_0 (around 2% slower) despite the use of the LUT. The parsing of the 4 bit packed weights are the same but Q4_0 has an extra float conversion and subtraction versus the LUT indexing in IQ4_NL. My guess is that it takes only several clock cycles to do a shared memory read.

I mean theoretically a subgroup shuffle should be the best way to handle this, as the data would be directly copied between registers rather than having to go through shared memory. Interestingly though I saw in #10999 that shuffles were slower than shared memory when applied to the IQ4_NL LUT, and I didn't bother exploring that further.

As for reusing the conversions and parallelizing them across threads that should almost guarantee better performance unless the shared memory or barriers slow things down a lot. So basically it's a race between:

load from main memory -> convert to float -> upload to shared memory -> barrier -> read shared memory x4

load from main memory x4 (a smart GPU might be able to reduce this) -> convert to float x4 -> read from registers x4

I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost

I'm definitely interested in seeing those when they're ready 😉

I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost

Unfortunately this didn't pan out. I can get some decent gains in the directed tests, but in real models (where the L2 miss rate is much higher) it's bandwidth-limited and doesn't help.

If your changes are something like ced706d then that's giving me a 5% improvement in Q4_K inference speed as I'm compute bound.

model size params backend ngl threads main_gpu sm test t/s

llama 8B Q4_K - Small (Master) 4.36 GiB 8.03 B Vulkan 100 8 1 none tg128 27.49 ± 0.07

llama 8B Q4_K - Small (PR) 4.36 GiB 8.03 B Vulkan 100 8 1 none tg128 28.82 ± 0.02

I had similar changes to the bit twiddling on the scales, I agree we should do those (and the same for q5_k). I was also using the "fast int to float" trick (in fp16x2) where you OR the integer into the mantissa. I've pushed the changes at jeffbolznv@33b2512 if you want to try it.

With the same bit twiddling change on Q5_K I'm seeing a smaller 3% improvement, but it's still an improvement!

model size params backend ngl threads main_gpu sm test t/s

llama 8B Q5_K - Small (Master) 5.21 GiB 8.03 B Vulkan 100 8 1 none tg128 22.39 ±0.02

llama 8B Q5_K - Small (PR) 5.21 GiB 8.03 B Vulkan 100 8 1 none tg128 23.02 ± 0.02

I was also using the "fast int to float" trick (in fp16x2) where you OR the integer into the mantissa.

That's an interesting trick. I think it's only worth doing on newer GPUs which pack the f16vec2 into a single 32 bit register and do two FP16 calculations at once, and if I quickly estimate the instruction counts on paper it only saves a few cycles compared to the regular unpack and vector conversion. On my RX 470 with no packed FP16 I get a much slower 23 t/s on Q4_K, and the shader straight up fails to compile on my W8100 as that card has no FP16 support. Are you seeing a big improvement on your 4070?

It's about 10% faster in the directed tests, but it doesn't improve real models because they're bandwidth limited. So I'm not pursuing it any further.

jeffbolznv · 2025-01-05T16:13:31Z

RTX 4070 results. Keep in mind there's a lot of variability in the results, but at first glance it seems like an improvement for Q3_K but worse for the others:

after:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  17040 runs -    60.51 us/run - 117.44 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10650 runs -    95.48 us/run - 234.88 MFLOP/run -   2.46 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7952 runs -   126.38 us/run - 352.32 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6603 runs -   153.03 us/run - 469.76 MFLOP/run -   3.07 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6498 runs -   157.50 us/run - 587.20 MFLOP/run -   3.73 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2568 runs -   402.20 us/run - 939.52 MFLOP/run -   2.34 TFLOPS

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  12780 runs -    79.41 us/run - 117.44 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9798 runs -   105.30 us/run - 234.88 MFLOP/run -   2.23 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   148.98 us/run - 352.32 MFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5964 runs -   172.48 us/run - 469.76 MFLOP/run -   2.72 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4617 runs -   220.06 us/run - 587.20 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2675 runs -   384.43 us/run - 939.52 MFLOP/run -   2.44 TFLOPS

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   131.17 us/run - 117.44 MFLOP/run - 895.32 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   133.59 us/run - 234.88 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6248 runs -   163.28 us/run - 352.32 MFLOP/run -   2.16 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6603 runs -   153.27 us/run - 469.76 MFLOP/run -   3.06 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6156 runs -   164.57 us/run - 587.20 MFLOP/run -   3.57 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3959 runs -   253.39 us/run - 939.52 MFLOP/run -   3.71 TFLOPS

before:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  18744 runs -    54.18 us/run - 117.44 MFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    73.36 us/run - 234.88 MFLOP/run -   3.20 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10508 runs -    95.71 us/run - 352.32 MFLOP/run -   3.68 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8307 runs -   122.29 us/run - 469.76 MFLOP/run -   3.84 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7011 runs -   145.50 us/run - 587.20 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3531 runs -   284.81 us/run - 939.52 MFLOP/run -   3.30 TFLOPS

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -   104.74 us/run - 117.44 MFLOP/run -   1.12 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8520 runs -   123.00 us/run - 234.88 MFLOP/run -   1.91 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   148.62 us/run - 352.32 MFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6177 runs -   162.05 us/run - 469.76 MFLOP/run -   2.90 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4959 runs -   203.90 us/run - 587.20 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3317 runs -   309.30 us/run - 939.52 MFLOP/run -   3.04 TFLOPS

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -   105.69 us/run - 117.44 MFLOP/run -   1.11 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9372 runs -   110.35 us/run - 234.88 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8236 runs -   122.10 us/run - 352.32 MFLOP/run -   2.89 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7029 runs -   142.71 us/run - 469.76 MFLOP/run -   3.29 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6156 runs -   166.92 us/run - 587.20 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4601 runs -   217.50 us/run - 939.52 MFLOP/run -   4.32 TFLOPS

netrunnereve · 2025-01-05T19:58:31Z

For multiple ns I'm seeing clear improvements with Q3_K and Q6_K, but Q2_K is much less consistent and is in some cases slower than master.

PR:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   337.74 us/run - 234.88 MFLOP/run - 695.45 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   441.88 us/run - 352.32 MFLOP/run - 797.33 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   566.21 us/run - 469.76 MFLOP/run - 829.66 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   740.15 us/run - 587.20 MFLOP/run - 793.36 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1064.79 us/run - 939.52 MFLOP/run - 882.36 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   454.94 us/run - 234.88 MFLOP/run - 516.29 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1988 runs -   539.48 us/run - 352.32 MFLOP/run - 653.08 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1491 runs -   754.86 us/run - 469.76 MFLOP/run - 622.32 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1197 runs -   862.33 us/run - 587.20 MFLOP/run - 680.95 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    856 runs -  1182.12 us/run - 939.52 MFLOP/run - 794.78 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   388.88 us/run - 234.88 MFLOP/run - 603.99 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   464.96 us/run - 352.32 MFLOP/run - 757.74 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   550.29 us/run - 469.76 MFLOP/run - 853.67 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   675.49 us/run - 587.20 MFLOP/run - 869.30 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1070 runs -   966.01 us/run - 939.52 MFLOP/run - 972.59 GFLOPS

Master:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   336.28 us/run - 234.88 MFLOP/run - 698.47 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   458.81 us/run - 352.32 MFLOP/run - 767.90 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   573.96 us/run - 469.76 MFLOP/run - 818.45 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   727.08 us/run - 587.20 MFLOP/run - 807.62 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1067.80 us/run - 939.52 MFLOP/run - 879.87 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2130 runs -   543.67 us/run - 234.88 MFLOP/run - 432.03 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   642.54 us/run - 352.32 MFLOP/run - 548.33 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1278 runs -   885.94 us/run - 469.76 MFLOP/run - 530.24 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1026 runs -  1004.95 us/run - 587.20 MFLOP/run - 584.31 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    856 runs -  1270.78 us/run - 939.52 MFLOP/run - 739.33 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   425.50 us/run - 234.88 MFLOP/run - 552.01 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1988 runs -   537.97 us/run - 352.32 MFLOP/run - 654.91 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   625.29 us/run - 469.76 MFLOP/run - 751.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   771.12 us/run - 587.20 MFLOP/run - 761.49 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1076.21 us/run - 939.52 MFLOP/run - 872.99 GFLOPS

I tried calculating the A * scale multiplication ahead of time for Q2_K, but it didn't do much. That also should reduce the number of shared memory reads as the products are stored in registers.

A * scale multiplication cached in registers:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   332.67 us/run - 234.88 MFLOP/run - 706.06 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   443.91 us/run - 352.32 MFLOP/run - 793.69 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   565.84 us/run - 469.76 MFLOP/run - 830.20 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   741.69 us/run - 587.20 MFLOP/run - 791.71 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1071.39 us/run - 939.52 MFLOP/run - 876.92 GFLOPS

This reverts commit 65110b8.

0cc4m · 2025-01-07T07:16:55Z

I'll post benchmarks at a later point, but this reduces performance on RTX 3090 for q2_k and q6_k. I see small improvements on Radeon Pro VII. Intel still crashes, but only in test-backend-ops -o MUL_MAT. I don't know what's going on there, since test-backend-ops -o MUL_MAT perf passes just fine. Looking at the perf results, it's a small improvement on A770, too.

jeffbolznv · 2025-01-07T14:22:55Z

IMO the crash is still very likely related to the barriers in nonuniform control flow. It really needs to be fixed if we're going to use shared memory here. If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf (likely making it worse, I'd guess).

netrunnereve · 2025-01-07T21:39:46Z

If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf

To get rid of the branches we could just have the main i loop run with no checks as long as we have enough blocks remaining to use all threads, and then switch to a separate code path for the final multiplications. There's no need to redo the algorithm.

…n use, plus some more optimizations

netrunnereve · 2025-01-08T01:06:10Z

Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash?

jeffbolznv · 2025-01-08T04:09:15Z

I tested the latest Q6_K changes on RTX 4070. For llama-bench with llama-2-7b.Q6_K, the perf is basically unchanged, which is not surprising since it's just memory bandwidth-limited. The directed perf results are more interesting:

before:
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  46860 runs -   107.44 us/run - 117.44 MFLOP/run -   1.09 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45582 runs -   110.08 us/run - 234.88 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39760 runs -   126.70 us/run - 352.32 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  33654 runs -   149.37 us/run - 469.76 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  30438 runs -   164.95 us/run - 587.20 MFLOP/run -   3.56 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  22684 runs -   221.28 us/run - 939.52 MFLOP/run -   4.25 TFLOPS

after:
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45156 runs -   112.21 us/run - 117.44 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  46860 runs -   106.90 us/run - 234.88 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  43168 runs -   116.55 us/run - 352.32 MFLOP/run -   3.02 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44304 runs -   113.42 us/run - 469.76 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  37962 runs -   132.16 us/run - 587.20 MFLOP/run -   4.44 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9202 runs -   544.83 us/run - 939.52 MFLOP/run -   1.72 TFLOPS

So there's a nice boost for larger n, but it just falls off a cliff for n=8. I looked into this, and what's happening is the barriers are causing all the loads of the B matrix to be bunched together, and it's using too many registers. I tried moving all the B loads to the start of the function and saving them in local arrays, and that seems to resolve the issue:

with loads at the top:

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  48564 runs -   104.69 us/run - 117.44 MFLOP/run -   1.12 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  47286 runs -   106.60 us/run - 234.88 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40328 runs -   124.44 us/run - 352.32 MFLOP/run -   2.83 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44091 runs -   113.45 us/run - 469.76 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39159 runs -   127.93 us/run - 587.20 MFLOP/run -   4.59 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  22791 runs -   220.12 us/run - 939.52 MFLOP/run -   4.27 TFLOPS

netrunnereve added 6 commits January 4, 2025 14:29

q6_k scale caching

d122d5c

16 bit unpack

6b06d16

q4_k test (slow)

21c6b80

revert it

b0e4ccb

q3_k

07d0d58

q2_k

d70a731

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 5, 2025

netrunnereve requested a review from 0cc4m January 5, 2025 02:26

netrunnereve force-pushed the vulkan branch from 89cbbc6 to 1997b8e Compare January 5, 2025 02:37

github-actions bot added script Script related python python script changes Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 5, 2025

little stuff

c01ccf8

netrunnereve force-pushed the vulkan branch from 1997b8e to c01ccf8 Compare January 5, 2025 02:42

netrunnereve removed script Script related python python script changes Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 5, 2025

Merge branch 'ggerganov:master' into vulkan

3a7e1ce

jeffbolznv self-requested a review January 5, 2025 05:36

0cc4m reviewed Jan 5, 2025

View reviewed changes

jeffbolznv requested changes Jan 5, 2025

View reviewed changes

netrunnereve added 5 commits January 5, 2025 14:59

try precalculating products of a and q2_k scales

65110b8

Revert "try precalculating products of a and q2_k scales"

9d907e2

This reverts commit 65110b8.

unpack should be u16, add vim swap to gitignore (about time)

dc6afb7

better q4_k scales

7dc1168

q5_k

01024f9

netrunnereve force-pushed the vulkan branch from dc07407 to 01024f9 Compare January 7, 2025 02:13

better q6_k with separate paths for all threads and partial threads i…

d8072e7

…n use, plus some more optimizations

netrunnereve added 3 commits January 8, 2025 01:06

Merge branch 'ggerganov:master' into vulkan

66dea01

q2_k better dequant

6c52f3b

q3_k optimizations

caf7ffa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: scale caching for k quants + misc fixes #11081

vulkan: scale caching for k quants + misc fixes #11081

netrunnereve commented Jan 5, 2025

0cc4m Jan 5, 2025

jeffbolznv Jan 5, 2025

netrunnereve Jan 5, 2025 •

edited

Loading

netrunnereve Jan 5, 2025

jeffbolznv Jan 6, 2025

jeffbolznv Jan 5, 2025

jeffbolznv Jan 5, 2025

netrunnereve Jan 5, 2025

jeffbolznv Jan 6, 2025

netrunnereve Jan 6, 2025

jeffbolznv Jan 6, 2025

netrunnereve Jan 7, 2025

jeffbolznv Jan 7, 2025

jeffbolznv commented Jan 5, 2025

netrunnereve commented Jan 5, 2025

0cc4m commented Jan 7, 2025

jeffbolznv commented Jan 7, 2025

netrunnereve commented Jan 7, 2025

netrunnereve commented Jan 8, 2025 •

edited

Loading

jeffbolznv commented Jan 8, 2025

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q4_K - Small (Master)	4.36 GiB	8.03 B	Vulkan	100	8	1	none	tg128	27.49 ± 0.07
llama 8B Q4_K - Small (PR)	4.36 GiB	8.03 B	Vulkan	100	8	1	none	tg128	28.82 ± 0.02

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 8B Q5_K - Small (Master)	5.21 GiB	8.03 B	Vulkan	100	8	1	none	tg128	22.39 ±0.02
llama 8B Q5_K - Small (PR)	5.21 GiB	8.03 B	Vulkan	100	8	1	none	tg128	23.02 ± 0.02

vulkan: scale caching for k quants + misc fixes #11081

Are you sure you want to change the base?

vulkan: scale caching for k quants + misc fixes #11081

Conversation

netrunnereve commented Jan 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netrunnereve Jan 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffbolznv commented Jan 5, 2025

netrunnereve commented Jan 5, 2025

0cc4m commented Jan 7, 2025

jeffbolznv commented Jan 7, 2025

netrunnereve commented Jan 7, 2025

netrunnereve commented Jan 8, 2025 • edited Loading

jeffbolznv commented Jan 8, 2025

netrunnereve Jan 5, 2025 •

edited

Loading

netrunnereve commented Jan 8, 2025 •

edited

Loading