Add support for `i8` dtype, add `--raw_accumulators` flag, add `--target=host_cpu` for easy local testing. #22

bjacob · 2024-10-10T18:26:13Z

A few unrelated things mixed in this PR, but they are separate commits if you'd prefer me to slice it into 3 PRs.

Add a --raw_accumulators flag that drops the truncation of the results (default False). This leads to lower arithmetic intensity (because the result values are larger) and either higher or lower performance. This is less representative of real workloads, but is sometimes easier to reason about as a microbenchmark.
Add support for i8 dtype accumulating into i32. For now only added to the square problem set. Also added bf16 to that set.
Add a special value for the existing --target flag: "host_cpu" for testing on CPU configured for the host. This was mostly for my own use to be able to develop these changes locally without a GPU.

Signed-off-by: Benoit Jacob <[email protected]> r

Signed-off-by: Benoit Jacob <[email protected]> x

Signed-off-by: Benoit Jacob <[email protected]>

gemmbench/gemm_utils.py

gemmbench/problems.py

gemmbench/gemm_bench.py

bjacob · 2024-10-10T18:57:30Z

Sample results on CPU:

benoit @ hocher: ~/iree-kernel-benchmark                                                                                                                                                                               raw_accumulators
$ cat results/iree_gemm.csv                           
index,tag,name,vmfb_hash,M,N,K,dtype,tA,tB,mean_microseconds,arithmetic_intensity,tflops,ok
0,square,gemm_128_128_128_f16_f32_tB,b09382fea54ae5536739e5df4596fee2,128,128,128,f16,N,T,63.0,42.6667,0.0666,True
1,square,gemm_256_256_256_f16_f32_tB,6df37abbc05238fb60a9b969259c131a,256,256,256,f16,N,T,116.0,85.3333,0.2893,True
2,square,gemm_512_512_512_f16_f32_tB,075fd059b12944e4f3ad621cf7082b45,512,512,512,f16,N,T,490.0,170.6667,0.5478,True
3,square,gemm_1024_1024_1024_f16_f32_tB,fe57ad03be60b58e160bd33ed2770a55,1024,1024,1024,f16,N,T,3440.0,341.3333,0.6243,True
4,square,gemm_2048_2048_2048_f16_f32_tB,b41f1f06fabe60fe73d30540b6446589,2048,2048,2048,f16,N,T,26900.0,682.6667,0.6387,True
5,square,gemm_4096_4096_4096_f16_f32_tB,be1ae424c5c36aa6bc581e9407250edf,4096,4096,4096,f16,N,T,219000.0,1365.3333,0.6276,True
6,square,gemm_8192_8192_8192_f16_f32_tB,22d05d920d341a2309111188e4578149,8192,8192,8192,f16,N,T,1743000.0,2730.6667,0.6308,True
7,square,gemm_128_128_128_bf16_f32_tB,5c155dc0035b4ddf883144fe03c6c243,128,128,128,bf16,N,T,59.0,42.6667,0.0711,True
8,square,gemm_256_256_256_bf16_f32_tB,912d26a5940042211334afacfc48cf42,256,256,256,bf16,N,T,79.0,85.3333,0.4247,True
9,square,gemm_512_512_512_bf16_f32_tB,972fea5500a2b7cf5dadaf18d7cb3f47,512,512,512,bf16,N,T,174.0,170.6667,1.5427,True
10,square,gemm_1024_1024_1024_bf16_f32_tB,3da6884a855976ca5ae015b5dda02fc9,1024,1024,1024,bf16,N,T,941.0,341.3333,2.2821,True
11,square,gemm_2048_2048_2048_bf16_f32_tB,93e08e504003e04ff4fa84c0db38098b,2048,2048,2048,bf16,N,T,7930.0,682.6667,2.1664,True
12,square,gemm_4096_4096_4096_bf16_f32_tB,db70d3fbf07a4a316e76ced74d95fcc6,4096,4096,4096,bf16,N,T,66800.0,1365.3333,2.0575,True
13,square,gemm_8192_8192_8192_bf16_f32_tB,d2de3b120dadf4ca63de37489371da6e,8192,8192,8192,bf16,N,T,650000.0,2730.6667,1.6916,True
14,square,gemm_128_128_128_i8_i32_tB,9ca12aded503bffcc03be47c990a7345,128,128,128,i8,N,T,51.0,85.3333,0.0822,True
15,square,gemm_256_256_256_i8_i32_tB,208edce214027caf19bf2d9c775bc2e1,256,256,256,i8,N,T,81.0,170.6667,0.4143,True
16,square,gemm_512_512_512_i8_i32_tB,09e78b7cbe1c28f124baa3cf7dbdbf00,512,512,512,i8,N,T,181.0,341.3333,1.4831,True
17,square,gemm_1024_1024_1024_i8_i32_tB,9eae5fb97ce4f416fd487221078c1450,1024,1024,1024,i8,N,T,952.0,682.6667,2.2558,True
18,square,gemm_2048_2048_2048_i8_i32_tB,f3503347e2b1ac1754f5cd50a5170a4e,2048,2048,2048,i8,N,T,7090.0,1365.3333,2.4231,True
19,square,gemm_4096_4096_4096_i8_i32_tB,a3e70f8a23d081ec359608e23deb41b8,4096,4096,4096,i8,N,T,58300.0,2730.6667,2.3574,True
20,square,gemm_8192_8192_8192_i8_i32_tB,ec6e7fdf34a2a9632653e0b10f0b84e7,8192,8192,8192,i8,N,T,527000.0,5461.3333,2.0864,True
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________
benoit @ hocher: ~/iree-kernel-benchmark                                                                                                                                                                               raw_accumulators
$ cat results/iree_gemm_raw_accumulators.csv 
index,tag,name,vmfb_hash,M,N,K,dtype,tA,tB,mean_microseconds,arithmetic_intensity,tflops,ok
0,square,gemm_128_128_128_f16_f32_tB,530e2d0d287e2461961c9cb047ed38d4,128,128,128,f16,N,T,56.0,32.0,0.0749,True
1,square,gemm_256_256_256_f16_f32_tB,58a1cd35216af5e482523010f4c76e54,256,256,256,f16,N,T,117.0,64.0,0.2868,True
2,square,gemm_512_512_512_f16_f32_tB,85d830845451a69c7b1d69d1d8207e1c,512,512,512,f16,N,T,490.0,128.0,0.5478,True
3,square,gemm_1024_1024_1024_f16_f32_tB,07f4a19c3f07416577cc4c39cbabc541,1024,1024,1024,f16,N,T,3430.0,256.0,0.6261,True
4,square,gemm_2048_2048_2048_f16_f32_tB,4f4f0f96965ed217c5ae70f70154ed62,2048,2048,2048,f16,N,T,26900.0,512.0,0.6387,True
5,square,gemm_4096_4096_4096_f16_f32_tB,fd15ad1301a38b38cf56ac6f727f3f5d,4096,4096,4096,f16,N,T,220000.0,1024.0,0.6247,True
6,square,gemm_8192_8192_8192_f16_f32_tB,0888464d30b0e1cbcef0d6be1777abb9,8192,8192,8192,f16,N,T,1745000.0,2048.0,0.6301,True
7,square,gemm_128_128_128_bf16_f32_tB,d33431aa67196f1afc3791d1b910f018,128,128,128,bf16,N,T,64.0,32.0,0.0655,True
8,square,gemm_256_256_256_bf16_f32_tB,6140e642a5c1d02029689b6e77682b26,256,256,256,bf16,N,T,82.0,64.0,0.4092,True
9,square,gemm_512_512_512_bf16_f32_tB,dddd7767d6408d75bb258dd1a0a4eff7,512,512,512,bf16,N,T,188.0,128.0,1.4278,True
10,square,gemm_1024_1024_1024_bf16_f32_tB,d991a20720d289cdf181f92d806d4f94,1024,1024,1024,bf16,N,T,972.0,256.0,2.2093,True
11,square,gemm_2048_2048_2048_bf16_f32_tB,4e0559db31352a81f93bec7ab8f1a9a5,2048,2048,2048,bf16,N,T,7410.0,512.0,2.3185,True
12,square,gemm_4096_4096_4096_bf16_f32_tB,a76d157cf2087c588c838e8343ec55c7,4096,4096,4096,bf16,N,T,68800.0,1024.0,1.9977,True
13,square,gemm_8192_8192_8192_bf16_f32_tB,de0a529d82abf0c907c9e2bf03f5dafd,8192,8192,8192,bf16,N,T,1287000.0,2048.0,0.8543,True
14,square,gemm_128_128_128_i8_i32_tB,009529a28240bf142d9004ad08ec44b4,128,128,128,i8,N,T,65.0,42.6667,0.0645,True
15,square,gemm_256_256_256_i8_i32_tB,f723d5d9684ff65be3e5161247d0d5ef,256,256,256,i8,N,T,82.0,85.3333,0.4092,True
16,square,gemm_512_512_512_i8_i32_tB,727afd44598b1e7c895cefa2621ad617,512,512,512,i8,N,T,187.0,170.6667,1.4355,True
17,square,gemm_1024_1024_1024_i8_i32_tB,2e620ec57eb0638da7eb2ce22c6e0c22,1024,1024,1024,i8,N,T,962.0,341.3333,2.2323,True
18,square,gemm_2048_2048_2048_i8_i32_tB,2a5cbc969650f2a73689f90d7f03e9ac,2048,2048,2048,i8,N,T,7650.0,682.6667,2.2457,True
19,square,gemm_4096_4096_4096_i8_i32_tB,2ced5edd993aaf9a6ca3bf474c15372e,4096,4096,4096,i8,N,T,60100.0,1365.3333,2.2868,True
20,square,gemm_8192_8192_8192_i8_i32_tB,0a7e72d4968bba0da2bf66fcbba17809,8192,8192,8192,i8,N,T,468000.0,2730.6667,2.3494,True

saienduri · 2024-10-10T18:58:20Z

gemmbench/problems.py

+        # The raw_accumulators arg means "test configs where the result element
+        # type is different from what it would be in the default mode".
+        # We can't just test for (result_element_type == accumulator_element_type),
+        # as that would cause e.g. f32 matmuls to be omitted in the default mode.


Should we add a warning here that tells the user they are trying to do a raw_accumulators run where the config.operand_element_type== get_default_accumulator_element_type(config.operand_element_type) which we don't run, so they aren't confused

Such a warning would print every time the user passes --raw_accumulators, right? I was thinking that since this flag is non-default, it's OK for it to have slightly suprising semantics of omitting the cases that happen to be already covered by the default mode. I cared more about keeping the default mode unsurprising (including if in the future we add f32 benchmarks) and avoiding overlap between the two modes redundantly covering the same cases (would be wasteful if running both modes one after the other).

It would only print if they are running --raw_accumulators with f32 or i32 input configs. And if that is the case, might be worth to have a small print or warning to let them know they are being skipped because of this, but fine with either way

I went ahead and merged this because I want to use i8 in my experiments

gemmbench/gemm_bench.py

bjacob added 3 commits October 10, 2024 14:14

add a raw_accumulators flag

d5496ef

Signed-off-by: Benoit Jacob <[email protected]> r

add i8 square matmuls

bf31ed9

Signed-off-by: Benoit Jacob <[email protected]> x

Add --target=host_cpu

51db4b4

Signed-off-by: Benoit Jacob <[email protected]>

bjacob requested review from kuhar and saienduri October 10, 2024 18:26

kuhar reviewed Oct 10, 2024

View reviewed changes

gemmbench/gemm_utils.py Outdated Show resolved Hide resolved

gemmbench/problems.py Show resolved Hide resolved

kuhar reviewed Oct 10, 2024

View reviewed changes

gemmbench/gemm_bench.py Show resolved Hide resolved

kuhar reviewed Oct 10, 2024

View reviewed changes

gemmbench/gemm_bench.py Outdated Show resolved Hide resolved

review comments

0e1204f

bjacob requested a review from kuhar October 10, 2024 18:57

saienduri reviewed Oct 10, 2024

View reviewed changes

kuhar approved these changes Oct 10, 2024

View reviewed changes

gemmbench/gemm_bench.py Show resolved Hide resolved

kuhar merged commit 91f1260 into nod-ai:main Oct 10, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `i8` dtype, add `--raw_accumulators` flag, add `--target=host_cpu` for easy local testing. #22

Add support for `i8` dtype, add `--raw_accumulators` flag, add `--target=host_cpu` for easy local testing. #22

bjacob commented Oct 10, 2024 •

edited

Loading

bjacob commented Oct 10, 2024

saienduri Oct 10, 2024 •

edited

Loading

bjacob Oct 10, 2024

saienduri Oct 10, 2024

kuhar Oct 10, 2024

Add support for i8 dtype, add --raw_accumulators flag, add --target=host_cpu for easy local testing. #22

Add support for i8 dtype, add --raw_accumulators flag, add --target=host_cpu for easy local testing. #22

Conversation

bjacob commented Oct 10, 2024 • edited Loading

bjacob commented Oct 10, 2024

saienduri Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

bjacob Oct 10, 2024

Choose a reason for hiding this comment

saienduri Oct 10, 2024

Choose a reason for hiding this comment

kuhar Oct 10, 2024

Choose a reason for hiding this comment

Add support for `i8` dtype, add `--raw_accumulators` flag, add `--target=host_cpu` for easy local testing. #22

Add support for `i8` dtype, add `--raw_accumulators` flag, add `--target=host_cpu` for easy local testing. #22

bjacob commented Oct 10, 2024 •

edited

Loading

saienduri Oct 10, 2024 •

edited

Loading