Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX and AVX2 targets perform better with i32x4 width #16

Open
akb825 opened this issue Jul 19, 2022 · 2 comments
Open

AVX and AVX2 targets perform better with i32x4 width #16

akb825 opened this issue Jul 19, 2022 · 2 comments

Comments

@akb825
Copy link

akb825 commented Jul 19, 2022

The current compilation in CMakeLists.txt passes avx and avx2 as targets to ISPC. These default to avx1-i32x8 and avx2-i32x8 respectively. However, when timing conversion of a very large input image performance was significantly better when using avx1-i32x4 and avx2-i32x4. On a laptop with an i7-8750H, avx1-i32x8 was almost 2x slower than avx1-i32x4, while avx2-i32x8 was ~20% slower than avx2-i32x4. On a desktop with a Ryzen 2700X, i32x8 was almost 2x slower than i32x4 for both avx1 and avx2.

A couple other interesting things of note:

  • On the i7, SSE4 and AVX1 performed identically.
  • On the Ryzen system, AVX2 was actually ~10% slower than AVX1, and AVX1 was very slightly (~2-3%) faster than SSE4. Some research seems to suggest that newer Ryzen models likely fare better with AVX2.

In other words, at least with the hardware I have available to me switching to i32x4 for any AVX targets universally gives a significant performance boost. In my own project (which includes bc7enc_rdo as well as ISPCTextureCompressor for BC6H support) I switched to avx2-i32x4 for the AVX2 target and dropped the AVX1 target due to not being different enough from SSE4 to justify the extra compile time and binary size.

Also worth noting that I saw a similar performance degradation when testing on an M1 Mac with using neon-i32x8 vs. neon-i32x4, which is what prompted me to check the difference in x86.

@richgel999
Copy link
Owner

Thanks for this info - at the time this was written SSE4 and AVX1 were the primary optimization target. I examined the generated ispc code (of the inner loopers) and tweaked it for AVX1 specifically. There are many possible ways of writing these inner loops, which could result in better perf for different targets.

@akb825
Copy link
Author

akb825 commented Jul 19, 2022

I only mentioned the AVX2 being slower on the Ryzen since it was unexpected, but I should clarify that on the i7 AVX2 is ~15-20% faster than AVX1. On the i7 it does follow the pattern that each newer target is measurably faster than the older ones with the exception of AVX1 being equal to SSE4, at least as long as you use the i32x4 width explicitly for any version of AVX. AVX2 on my Ryzen is the only outlier for this, but some searches reveal that on Zen1 (Ryzen 1000 and 2000 series) require two cycles for AVX2 operations, while Zen2 (Ryzen 3000 and later) only require one cycle. In other words, this outlier is a hardware limitation for some older models.

I unfortunately don't own any hardware that supports AVX512, so I can't test any of those targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants