Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for i8 dtype, add --raw_accumulators flag, add --target=host_cpu for easy local testing. #22

Merged
merged 4 commits into from
Oct 10, 2024

Conversation

bjacob
Copy link
Contributor

@bjacob bjacob commented Oct 10, 2024

A few unrelated things mixed in this PR, but they are separate commits if you'd prefer me to slice it into 3 PRs.

  1. Add a --raw_accumulators flag that drops the truncation of the results (default False). This leads to lower arithmetic intensity (because the result values are larger) and either higher or lower performance. This is less representative of real workloads, but is sometimes easier to reason about as a microbenchmark.
  2. Add support for i8 dtype accumulating into i32. For now only added to the square problem set. Also added bf16 to that set.
  3. Add a special value for the existing --target flag: "host_cpu" for testing on CPU configured for the host. This was mostly for my own use to be able to develop these changes locally without a GPU.

Signed-off-by: Benoit Jacob <[email protected]>

r
Signed-off-by: Benoit Jacob <[email protected]>

x
Signed-off-by: Benoit Jacob <[email protected]>
@bjacob bjacob requested review from kuhar and saienduri October 10, 2024 18:26
gemmbench/gemm_utils.py Outdated Show resolved Hide resolved
gemmbench/problems.py Show resolved Hide resolved
gemmbench/gemm_bench.py Outdated Show resolved Hide resolved
@bjacob
Copy link
Contributor Author

bjacob commented Oct 10, 2024

Sample results on CPU:

benoit @ hocher: ~/iree-kernel-benchmark                                                                                                                                                                               raw_accumulators
$ cat results/iree_gemm.csv                           
index,tag,name,vmfb_hash,M,N,K,dtype,tA,tB,mean_microseconds,arithmetic_intensity,tflops,ok
0,square,gemm_128_128_128_f16_f32_tB,b09382fea54ae5536739e5df4596fee2,128,128,128,f16,N,T,63.0,42.6667,0.0666,True
1,square,gemm_256_256_256_f16_f32_tB,6df37abbc05238fb60a9b969259c131a,256,256,256,f16,N,T,116.0,85.3333,0.2893,True
2,square,gemm_512_512_512_f16_f32_tB,075fd059b12944e4f3ad621cf7082b45,512,512,512,f16,N,T,490.0,170.6667,0.5478,True
3,square,gemm_1024_1024_1024_f16_f32_tB,fe57ad03be60b58e160bd33ed2770a55,1024,1024,1024,f16,N,T,3440.0,341.3333,0.6243,True
4,square,gemm_2048_2048_2048_f16_f32_tB,b41f1f06fabe60fe73d30540b6446589,2048,2048,2048,f16,N,T,26900.0,682.6667,0.6387,True
5,square,gemm_4096_4096_4096_f16_f32_tB,be1ae424c5c36aa6bc581e9407250edf,4096,4096,4096,f16,N,T,219000.0,1365.3333,0.6276,True
6,square,gemm_8192_8192_8192_f16_f32_tB,22d05d920d341a2309111188e4578149,8192,8192,8192,f16,N,T,1743000.0,2730.6667,0.6308,True
7,square,gemm_128_128_128_bf16_f32_tB,5c155dc0035b4ddf883144fe03c6c243,128,128,128,bf16,N,T,59.0,42.6667,0.0711,True
8,square,gemm_256_256_256_bf16_f32_tB,912d26a5940042211334afacfc48cf42,256,256,256,bf16,N,T,79.0,85.3333,0.4247,True
9,square,gemm_512_512_512_bf16_f32_tB,972fea5500a2b7cf5dadaf18d7cb3f47,512,512,512,bf16,N,T,174.0,170.6667,1.5427,True
10,square,gemm_1024_1024_1024_bf16_f32_tB,3da6884a855976ca5ae015b5dda02fc9,1024,1024,1024,bf16,N,T,941.0,341.3333,2.2821,True
11,square,gemm_2048_2048_2048_bf16_f32_tB,93e08e504003e04ff4fa84c0db38098b,2048,2048,2048,bf16,N,T,7930.0,682.6667,2.1664,True
12,square,gemm_4096_4096_4096_bf16_f32_tB,db70d3fbf07a4a316e76ced74d95fcc6,4096,4096,4096,bf16,N,T,66800.0,1365.3333,2.0575,True
13,square,gemm_8192_8192_8192_bf16_f32_tB,d2de3b120dadf4ca63de37489371da6e,8192,8192,8192,bf16,N,T,650000.0,2730.6667,1.6916,True
14,square,gemm_128_128_128_i8_i32_tB,9ca12aded503bffcc03be47c990a7345,128,128,128,i8,N,T,51.0,85.3333,0.0822,True
15,square,gemm_256_256_256_i8_i32_tB,208edce214027caf19bf2d9c775bc2e1,256,256,256,i8,N,T,81.0,170.6667,0.4143,True
16,square,gemm_512_512_512_i8_i32_tB,09e78b7cbe1c28f124baa3cf7dbdbf00,512,512,512,i8,N,T,181.0,341.3333,1.4831,True
17,square,gemm_1024_1024_1024_i8_i32_tB,9eae5fb97ce4f416fd487221078c1450,1024,1024,1024,i8,N,T,952.0,682.6667,2.2558,True
18,square,gemm_2048_2048_2048_i8_i32_tB,f3503347e2b1ac1754f5cd50a5170a4e,2048,2048,2048,i8,N,T,7090.0,1365.3333,2.4231,True
19,square,gemm_4096_4096_4096_i8_i32_tB,a3e70f8a23d081ec359608e23deb41b8,4096,4096,4096,i8,N,T,58300.0,2730.6667,2.3574,True
20,square,gemm_8192_8192_8192_i8_i32_tB,ec6e7fdf34a2a9632653e0b10f0b84e7,8192,8192,8192,i8,N,T,527000.0,5461.3333,2.0864,True
_______________________________________________________________________________________________________________________________________________________________________________________________________________________________________
benoit @ hocher: ~/iree-kernel-benchmark                                                                                                                                                                               raw_accumulators
$ cat results/iree_gemm_raw_accumulators.csv 
index,tag,name,vmfb_hash,M,N,K,dtype,tA,tB,mean_microseconds,arithmetic_intensity,tflops,ok
0,square,gemm_128_128_128_f16_f32_tB,530e2d0d287e2461961c9cb047ed38d4,128,128,128,f16,N,T,56.0,32.0,0.0749,True
1,square,gemm_256_256_256_f16_f32_tB,58a1cd35216af5e482523010f4c76e54,256,256,256,f16,N,T,117.0,64.0,0.2868,True
2,square,gemm_512_512_512_f16_f32_tB,85d830845451a69c7b1d69d1d8207e1c,512,512,512,f16,N,T,490.0,128.0,0.5478,True
3,square,gemm_1024_1024_1024_f16_f32_tB,07f4a19c3f07416577cc4c39cbabc541,1024,1024,1024,f16,N,T,3430.0,256.0,0.6261,True
4,square,gemm_2048_2048_2048_f16_f32_tB,4f4f0f96965ed217c5ae70f70154ed62,2048,2048,2048,f16,N,T,26900.0,512.0,0.6387,True
5,square,gemm_4096_4096_4096_f16_f32_tB,fd15ad1301a38b38cf56ac6f727f3f5d,4096,4096,4096,f16,N,T,220000.0,1024.0,0.6247,True
6,square,gemm_8192_8192_8192_f16_f32_tB,0888464d30b0e1cbcef0d6be1777abb9,8192,8192,8192,f16,N,T,1745000.0,2048.0,0.6301,True
7,square,gemm_128_128_128_bf16_f32_tB,d33431aa67196f1afc3791d1b910f018,128,128,128,bf16,N,T,64.0,32.0,0.0655,True
8,square,gemm_256_256_256_bf16_f32_tB,6140e642a5c1d02029689b6e77682b26,256,256,256,bf16,N,T,82.0,64.0,0.4092,True
9,square,gemm_512_512_512_bf16_f32_tB,dddd7767d6408d75bb258dd1a0a4eff7,512,512,512,bf16,N,T,188.0,128.0,1.4278,True
10,square,gemm_1024_1024_1024_bf16_f32_tB,d991a20720d289cdf181f92d806d4f94,1024,1024,1024,bf16,N,T,972.0,256.0,2.2093,True
11,square,gemm_2048_2048_2048_bf16_f32_tB,4e0559db31352a81f93bec7ab8f1a9a5,2048,2048,2048,bf16,N,T,7410.0,512.0,2.3185,True
12,square,gemm_4096_4096_4096_bf16_f32_tB,a76d157cf2087c588c838e8343ec55c7,4096,4096,4096,bf16,N,T,68800.0,1024.0,1.9977,True
13,square,gemm_8192_8192_8192_bf16_f32_tB,de0a529d82abf0c907c9e2bf03f5dafd,8192,8192,8192,bf16,N,T,1287000.0,2048.0,0.8543,True
14,square,gemm_128_128_128_i8_i32_tB,009529a28240bf142d9004ad08ec44b4,128,128,128,i8,N,T,65.0,42.6667,0.0645,True
15,square,gemm_256_256_256_i8_i32_tB,f723d5d9684ff65be3e5161247d0d5ef,256,256,256,i8,N,T,82.0,85.3333,0.4092,True
16,square,gemm_512_512_512_i8_i32_tB,727afd44598b1e7c895cefa2621ad617,512,512,512,i8,N,T,187.0,170.6667,1.4355,True
17,square,gemm_1024_1024_1024_i8_i32_tB,2e620ec57eb0638da7eb2ce22c6e0c22,1024,1024,1024,i8,N,T,962.0,341.3333,2.2323,True
18,square,gemm_2048_2048_2048_i8_i32_tB,2a5cbc969650f2a73689f90d7f03e9ac,2048,2048,2048,i8,N,T,7650.0,682.6667,2.2457,True
19,square,gemm_4096_4096_4096_i8_i32_tB,2ced5edd993aaf9a6ca3bf474c15372e,4096,4096,4096,i8,N,T,60100.0,1365.3333,2.2868,True
20,square,gemm_8192_8192_8192_i8_i32_tB,0a7e72d4968bba0da2bf66fcbba17809,8192,8192,8192,i8,N,T,468000.0,2730.6667,2.3494,True

@bjacob bjacob requested a review from kuhar October 10, 2024 18:57
# The raw_accumulators arg means "test configs where the result element
# type is different from what it would be in the default mode".
# We can't just test for (result_element_type == accumulator_element_type),
# as that would cause e.g. f32 matmuls to be omitted in the default mode.
Copy link
Contributor

@saienduri saienduri Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a warning here that tells the user they are trying to do a raw_accumulators run where the config.operand_element_type== get_default_accumulator_element_type(config.operand_element_type) which we don't run, so they aren't confused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a warning would print every time the user passes --raw_accumulators, right? I was thinking that since this flag is non-default, it's OK for it to have slightly suprising semantics of omitting the cases that happen to be already covered by the default mode. I cared more about keeping the default mode unsurprising (including if in the future we add f32 benchmarks) and avoiding overlap between the two modes redundantly covering the same cases (would be wasteful if running both modes one after the other).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would only print if they are running --raw_accumulators with f32 or i32 input configs. And if that is the case, might be worth to have a small print or warning to let them know they are being skipped because of this, but fine with either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and merged this because I want to use i8 in my experiments

gemmbench/gemm_bench.py Show resolved Hide resolved
@kuhar kuhar merged commit 91f1260 into nod-ai:main Oct 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants