Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: scale caching for k quants + misc fixes #11081

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

netrunnereve
Copy link
Collaborator

We can make inference run a bit faster by extracting the scales in parallel and saving them to shared memory, where they'll be used by all the threads working on the superblock. This came out of the experiments in #10999.

This was not done for Q4_K and Q5_K as their scales are packed in a complicated way which makes this method even slower.

PR:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   232.89 us/run - 117.44 MFLOP/run - 504.27 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   359.69 us/run - 117.44 MFLOP/run - 326.50 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5112 runs -   234.78 us/run - 117.44 MFLOP/run - 500.22 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   313.31 us/run - 117.44 MFLOP/run - 374.84 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   333.78 us/run - 117.44 MFLOP/run - 351.85 GFLOPS
model size params backend ngl threads main_gpu sm test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 100 8 1 none tg128 24.78 ± 0.03
llama 8B Q3_K - Medium 3.74 GiB 8.03 B Vulkan 100 8 1 none tg128 21.98 ± 0.02
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 22.27 ± 0.01

Master:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   241.10 us/run - 117.44 MFLOP/run - 487.09 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   449.01 us/run - 117.44 MFLOP/run - 261.56 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   235.58 us/run - 117.44 MFLOP/run - 498.51 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   315.21 us/run - 117.44 MFLOP/run - 372.58 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   365.79 us/run - 117.44 MFLOP/run - 321.06 GFLOPS
model size params backend ngl threads main_gpu sm test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 100 8 1 none tg128 22.15 ± 0.01
llama 8B Q3_K - Medium 3.74 GiB 8.03 B Vulkan 100 8 1 none tg128 18.97 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 20.38 ± 0.00

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 5, 2025
@netrunnereve netrunnereve requested a review from 0cc4m January 5, 2025 02:26
@github-actions github-actions bot added script Script related python python script changes Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 5, 2025
@netrunnereve netrunnereve removed script Script related python python script changes Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 5, 2025
@jeffbolznv jeffbolznv self-requested a review January 5, 2025 05:36
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved
scales[2] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 4]);
scales[3] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 6]);
sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
barrier();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look because the shaders you changed (q2_k, q3_k and q6_k) crash my Intel A770. It stops crashing if I remove this barrier.

If you add barriers, you need to make sure there are no early returns in the shader (undefined behaviour). In this case, we do, but that is easy to fix. Removing the early return does not fix the crash, so it's something else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of loop iterations in the outermost loop is nonuniform, so this doesn't work as-is. It's probably fixable.

Copy link
Collaborator Author

@netrunnereve netrunnereve Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize this. I can certainly fix it but the crash remains a problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to fix this but unfortunately the extra branches make it run much slower than master at 18 t/s. Now all the threads enter the loop and hit the barrier even if they don't have a block to calculate.

---------- ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp ----------
index e1afd55e..a10c194d 100644
@@ -39,16 +39,22 @@ void compute_outputs(const uint32_t first_row, const uint32_t num_rows) {
         }
     }
 
-    [[unroll]] for (uint i = ix; i < num_blocks_per_row; i += it_size) {
+    [[unroll]] for (uint i0 = 0; i0 < num_blocks_per_row; i0 += it_size) {
+        uint i = i0 + ix;
         const uint y_idx = i * QUANT_K + y_offset;
 
         [[unroll]] for (uint n = 0; n < num_rows; ++n) {
             const uint ib0 = a_offset / QUANT_K + (first_row+n)*num_blocks_per_row;
-            const FLOAT_TYPE d = FLOAT_TYPE(data_a[ib0 + i].d);
 
-            sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
+            if (i < num_blocks_per_row)
+                sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
             barrier();
 
+            if (i >= num_blocks_per_row)
+                continue;
+
+            const FLOAT_TYPE d = FLOAT_TYPE(data_a[ib0 + i].d);
+
             uint32_t ql0_u32 =  uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2]) | (uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 1]) << 16);
             uint32_t ql32_u32 = uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 16]) | (uint32_t(data_a_packed16[ib0 + i].ql[ql_offset / 2 + 17]) << 16);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to fix this but unfortunately the extra branches make it run much slower than master at 18 t/s.

I'd guess that the branches cause the compiler to insert a wait on the scale load and it can't overlap other work with it.

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved
scales[2] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 4]);
scales[3] = FLOAT_TYPE(data_a[ib0 + i].scales[s_offset + 6]);
sccache[ix][itid] = FLOAT_TYPE(data_a[ib0 + i].scales[itid]);
barrier();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of loop iterations in the outermost loop is nonuniform, so this doesn't work as-is. It's probably fixable.

uvec4 s0_hi4 = uvec4(unpack8(s0_hi4_u32));
uvec4 s4_hi4 = uvec4(unpack8(s4_hi4_u32));
sccache[ix][0][itid] = FLOAT_TYPE(bitfieldExtract(uint(data_a[ib0 + i].scales[itid8]), int(v_im*4), 4)); // lower 8 bytes
sccache[ix][1][itid] = FLOAT_TYPE(bitfieldExtract(uint(data_a[ib0 + i].scales[itid8+8]), int(v_im*4), 4)); // upper 8 bytes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA GPUs are partly bottlenecked by the unpacking and conversion to float, so reusing work across threads may help (though I'm not sure if shared memory loads will be faster). The shared memory use and barrier also puts some constraints on how the compiler can reorder things when NUM_COLS is greater than one, so we'll need to perf test that too.

I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost (some will be applicable to other types, too). I'll share them in the next day or so and we can try them both out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though I'm not sure if shared memory loads will be faster

For the speed of shared memory loads I generally use IQ4_NL as a guideline, which is nearly the same speed as Q4_0 (around 2% slower) despite the use of the LUT. The parsing of the 4 bit packed weights are the same but Q4_0 has an extra float conversion and subtraction versus the LUT indexing in IQ4_NL. My guess is that it takes only several clock cycles to do a shared memory read.

I mean theoretically a subgroup shuffle should be the best way to handle this, as the data would be directly copied between registers rather than having to go through shared memory. Interestingly though I saw in #10999 that shuffles were slower than shared memory when applied to the IQ4_NL LUT, and I didn't bother exploring that further.

As for reusing the conversions and parallelizing them across threads that should almost guarantee better performance unless the shared memory or barriers slow things down a lot. So basically it's a race between:

  1. load from main memory -> convert to float -> upload to shared memory -> barrier -> read shared memory x4
  2. load from main memory x4 (a smart GPU might be able to reduce this) -> convert to float x4 -> read from registers x4

I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost

I'm definitely interested in seeing those when they're ready 😉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some optimizations in progress for q4_k to reduce the unpacking and conversion cost

Unfortunately this didn't pan out. I can get some decent gains in the directed tests, but in real models (where the L2 miss rate is much higher) it's bandwidth-limited and doesn't help.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your changes are something like ced706d then that's giving me a 5% improvement in Q4_K inference speed as I'm compute bound.

model size params backend ngl threads main_gpu sm test t/s
llama 8B Q4_K - Small (Master) 4.36 GiB 8.03 B Vulkan 100 8 1 none tg128 27.49 ± 0.07
llama 8B Q4_K - Small (PR) 4.36 GiB 8.03 B Vulkan 100 8 1 none tg128 28.82 ± 0.02

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had similar changes to the bit twiddling on the scales, I agree we should do those (and the same for q5_k). I was also using the "fast int to float" trick (in fp16x2) where you OR the integer into the mantissa. I've pushed the changes at jeffbolznv@33b2512 if you want to try it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the same bit twiddling change on Q5_K I'm seeing a smaller 3% improvement, but it's still an improvement!

model size params backend ngl threads main_gpu sm test t/s
llama 8B Q5_K - Small (Master) 5.21 GiB 8.03 B Vulkan 100 8 1 none tg128 22.39 ±0.02
llama 8B Q5_K - Small (PR) 5.21 GiB 8.03 B Vulkan 100 8 1 none tg128 23.02 ± 0.02

I was also using the "fast int to float" trick (in fp16x2) where you OR the integer into the mantissa.

That's an interesting trick. I think it's only worth doing on newer GPUs which pack the f16vec2 into a single 32 bit register and do two FP16 calculations at once, and if I quickly estimate the instruction counts on paper it only saves a few cycles compared to the regular unpack and vector conversion. On my RX 470 with no packed FP16 I get a much slower 23 t/s on Q4_K, and the shader straight up fails to compile on my W8100 as that card has no FP16 support. Are you seeing a big improvement on your 4070?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about 10% faster in the directed tests, but it doesn't improve real models because they're bandwidth limited. So I'm not pursuing it any further.

@jeffbolznv
Copy link
Collaborator

RTX 4070 results. Keep in mind there's a lot of variability in the results, but at first glance it seems like an improvement for Q3_K but worse for the others:

after:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  17040 runs -    60.51 us/run - 117.44 MFLOP/run -   1.94 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10650 runs -    95.48 us/run - 234.88 MFLOP/run -   2.46 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7952 runs -   126.38 us/run - 352.32 MFLOP/run -   2.79 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6603 runs -   153.03 us/run - 469.76 MFLOP/run -   3.07 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6498 runs -   157.50 us/run - 587.20 MFLOP/run -   3.73 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2568 runs -   402.20 us/run - 939.52 MFLOP/run -   2.34 TFLOPS

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  12780 runs -    79.41 us/run - 117.44 MFLOP/run -   1.48 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9798 runs -   105.30 us/run - 234.88 MFLOP/run -   2.23 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   148.98 us/run - 352.32 MFLOP/run -   2.36 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   5964 runs -   172.48 us/run - 469.76 MFLOP/run -   2.72 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4617 runs -   220.06 us/run - 587.20 MFLOP/run -   2.67 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2675 runs -   384.43 us/run - 939.52 MFLOP/run -   2.44 TFLOPS

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   131.17 us/run - 117.44 MFLOP/run - 895.32 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   133.59 us/run - 234.88 MFLOP/run -   1.76 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6248 runs -   163.28 us/run - 352.32 MFLOP/run -   2.16 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6603 runs -   153.27 us/run - 469.76 MFLOP/run -   3.06 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6156 runs -   164.57 us/run - 587.20 MFLOP/run -   3.57 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3959 runs -   253.39 us/run - 939.52 MFLOP/run -   3.71 TFLOPS

before:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  18744 runs -    54.18 us/run - 117.44 MFLOP/run -   2.17 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    73.36 us/run - 234.88 MFLOP/run -   3.20 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10508 runs -    95.71 us/run - 352.32 MFLOP/run -   3.68 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8307 runs -   122.29 us/run - 469.76 MFLOP/run -   3.84 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7011 runs -   145.50 us/run - 587.20 MFLOP/run -   4.04 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3531 runs -   284.81 us/run - 939.52 MFLOP/run -   3.30 TFLOPS

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -   104.74 us/run - 117.44 MFLOP/run -   1.12 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8520 runs -   123.00 us/run - 234.88 MFLOP/run -   1.91 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   148.62 us/run - 352.32 MFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6177 runs -   162.05 us/run - 469.76 MFLOP/run -   2.90 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4959 runs -   203.90 us/run - 587.20 MFLOP/run -   2.88 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3317 runs -   309.30 us/run - 939.52 MFLOP/run -   3.04 TFLOPS

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -   105.69 us/run - 117.44 MFLOP/run -   1.11 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9372 runs -   110.35 us/run - 234.88 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8236 runs -   122.10 us/run - 352.32 MFLOP/run -   2.89 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7029 runs -   142.71 us/run - 469.76 MFLOP/run -   3.29 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6156 runs -   166.92 us/run - 587.20 MFLOP/run -   3.52 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4601 runs -   217.50 us/run - 939.52 MFLOP/run -   4.32 TFLOPS

@netrunnereve
Copy link
Collaborator Author

For multiple ns I'm seeing clear improvements with Q3_K and Q6_K, but Q2_K is much less consistent and is in some cases slower than master.

PR:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   337.74 us/run - 234.88 MFLOP/run - 695.45 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   441.88 us/run - 352.32 MFLOP/run - 797.33 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   566.21 us/run - 469.76 MFLOP/run - 829.66 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   740.15 us/run - 587.20 MFLOP/run - 793.36 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1064.79 us/run - 939.52 MFLOP/run - 882.36 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   454.94 us/run - 234.88 MFLOP/run - 516.29 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1988 runs -   539.48 us/run - 352.32 MFLOP/run - 653.08 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1491 runs -   754.86 us/run - 469.76 MFLOP/run - 622.32 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1197 runs -   862.33 us/run - 587.20 MFLOP/run - 680.95 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    856 runs -  1182.12 us/run - 939.52 MFLOP/run - 794.78 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2982 runs -   388.88 us/run - 234.88 MFLOP/run - 603.99 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   464.96 us/run - 352.32 MFLOP/run - 757.74 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   550.29 us/run - 469.76 MFLOP/run - 853.67 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   675.49 us/run - 587.20 MFLOP/run - 869.30 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1070 runs -   966.01 us/run - 939.52 MFLOP/run - 972.59 GFLOPS

Master:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   336.28 us/run - 234.88 MFLOP/run - 698.47 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   458.81 us/run - 352.32 MFLOP/run - 767.90 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   573.96 us/run - 469.76 MFLOP/run - 818.45 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1539 runs -   727.08 us/run - 587.20 MFLOP/run - 807.62 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1067.80 us/run - 939.52 MFLOP/run - 879.87 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2130 runs -   543.67 us/run - 234.88 MFLOP/run - 432.03 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   642.54 us/run - 352.32 MFLOP/run - 548.33 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1278 runs -   885.94 us/run - 469.76 MFLOP/run - 530.24 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1026 runs -  1004.95 us/run - 587.20 MFLOP/run - 584.31 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    856 runs -  1270.78 us/run - 939.52 MFLOP/run - 739.33 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   425.50 us/run - 234.88 MFLOP/run - 552.01 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1988 runs -   537.97 us/run - 352.32 MFLOP/run - 654.91 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1704 runs -   625.29 us/run - 469.76 MFLOP/run - 751.28 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   771.12 us/run - 587.20 MFLOP/run - 761.49 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1076.21 us/run - 939.52 MFLOP/run - 872.99 GFLOPS

I tried calculating the A * scale multiplication ahead of time for Q2_K, but it didn't do much. That also should reduce the number of shared memory reads as the products are stored in registers.

A * scale multiplication cached in registers:

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   332.67 us/run - 234.88 MFLOP/run - 706.06 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2272 runs -   443.91 us/run - 352.32 MFLOP/run - 793.69 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1917 runs -   565.84 us/run - 469.76 MFLOP/run - 830.20 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   1368 runs -   741.69 us/run - 587.20 MFLOP/run - 791.71 GFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    963 runs -  1071.39 us/run - 939.52 MFLOP/run - 876.92 GFLOPS

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 7, 2025

I'll post benchmarks at a later point, but this reduces performance on RTX 3090 for q2_k and q6_k. I see small improvements on Radeon Pro VII. Intel still crashes, but only in test-backend-ops -o MUL_MAT. I don't know what's going on there, since test-backend-ops -o MUL_MAT perf passes just fine. Looking at the perf results, it's a small improvement on A770, too.

@jeffbolznv
Copy link
Collaborator

IMO the crash is still very likely related to the barriers in nonuniform control flow. It really needs to be fixed if we're going to use shared memory here. If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf (likely making it worse, I'd guess).

@netrunnereve
Copy link
Collaborator Author

If the additional branches are causing too many problems then maybe we could change how the work is spread across a workgroup so that the number of iterations is uniform, but that could also affect perf

To get rid of the branches we could just have the main i loop run with no checks as long as we have enough blocks remaining to use all threads, and then switch to a separate code path for the final multiplications. There's no need to redo the algorithm.

@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Jan 8, 2025

Okay I've fixed up Q6_K to handle the early return case, and it's now running at 23.3 t/s with a few extra tweaks. @0cc4m can you try this on Intel to see if it prevents the crash?

@jeffbolznv
Copy link
Collaborator

I tested the latest Q6_K changes on RTX 4070. For llama-bench with llama-2-7b.Q6_K, the perf is basically unchanged, which is not surprising since it's just memory bandwidth-limited. The directed perf results are more interesting:

before:
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  46860 runs -   107.44 us/run - 117.44 MFLOP/run -   1.09 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45582 runs -   110.08 us/run - 234.88 MFLOP/run -   2.13 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39760 runs -   126.70 us/run - 352.32 MFLOP/run -   2.78 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  33654 runs -   149.37 us/run - 469.76 MFLOP/run -   3.15 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  30438 runs -   164.95 us/run - 587.20 MFLOP/run -   3.56 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  22684 runs -   221.28 us/run - 939.52 MFLOP/run -   4.25 TFLOPS

after:
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  45156 runs -   112.21 us/run - 117.44 MFLOP/run -   1.05 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  46860 runs -   106.90 us/run - 234.88 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  43168 runs -   116.55 us/run - 352.32 MFLOP/run -   3.02 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44304 runs -   113.42 us/run - 469.76 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  37962 runs -   132.16 us/run - 587.20 MFLOP/run -   4.44 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9202 runs -   544.83 us/run - 939.52 MFLOP/run -   1.72 TFLOPS

So there's a nice boost for larger n, but it just falls off a cliff for n=8. I looked into this, and what's happening is the barriers are causing all the loads of the B matrix to be bunched together, and it's using too many registers. I tried moving all the B loads to the start of the function and saving them in local arrays, and that seems to resolve the issue:

with loads at the top:

  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  48564 runs -   104.69 us/run - 117.44 MFLOP/run -   1.12 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  47286 runs -   106.60 us/run - 234.88 MFLOP/run -   2.20 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  40328 runs -   124.44 us/run - 352.32 MFLOP/run -   2.83 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  44091 runs -   113.45 us/run - 469.76 MFLOP/run -   4.14 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  39159 runs -   127.93 us/run - 587.20 MFLOP/run -   4.59 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  22791 runs -   220.12 us/run - 939.52 MFLOP/run -   4.27 TFLOPS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants