AVX and AVX2 targets perform better with i32x4 width #16

akb825 · 2022-07-19T05:01:34Z

The current compilation in CMakeLists.txt passes avx and avx2 as targets to ISPC. These default to avx1-i32x8 and avx2-i32x8 respectively. However, when timing conversion of a very large input image performance was significantly better when using avx1-i32x4 and avx2-i32x4. On a laptop with an i7-8750H, avx1-i32x8 was almost 2x slower than avx1-i32x4, while avx2-i32x8 was ~20% slower than avx2-i32x4. On a desktop with a Ryzen 2700X, i32x8 was almost 2x slower than i32x4 for both avx1 and avx2.

A couple other interesting things of note:

On the i7, SSE4 and AVX1 performed identically.
On the Ryzen system, AVX2 was actually ~10% slower than AVX1, and AVX1 was very slightly (~2-3%) faster than SSE4. Some research seems to suggest that newer Ryzen models likely fare better with AVX2.

In other words, at least with the hardware I have available to me switching to i32x4 for any AVX targets universally gives a significant performance boost. In my own project (which includes bc7enc_rdo as well as ISPCTextureCompressor for BC6H support) I switched to avx2-i32x4 for the AVX2 target and dropped the AVX1 target due to not being different enough from SSE4 to justify the extra compile time and binary size.

Also worth noting that I saw a similar performance degradation when testing on an M1 Mac with using neon-i32x8 vs. neon-i32x4, which is what prompted me to check the difference in x86.

The text was updated successfully, but these errors were encountered:

richgel999 · 2022-07-19T05:05:46Z

Thanks for this info - at the time this was written SSE4 and AVX1 were the primary optimization target. I examined the generated ispc code (of the inner loopers) and tweaked it for AVX1 specifically. There are many possible ways of writing these inner loops, which could result in better perf for different targets.

akb825 · 2022-07-19T09:07:51Z

I only mentioned the AVX2 being slower on the Ryzen since it was unexpected, but I should clarify that on the i7 AVX2 is ~15-20% faster than AVX1. On the i7 it does follow the pattern that each newer target is measurably faster than the older ones with the exception of AVX1 being equal to SSE4, at least as long as you use the i32x4 width explicitly for any version of AVX. AVX2 on my Ryzen is the only outlier for this, but some searches reveal that on Zen1 (Ryzen 1000 and 2000 series) require two cycles for AVX2 operations, while Zen2 (Ryzen 3000 and later) only require one cycle. In other words, this outlier is a hardware limitation for some older models.

I unfortunately don't own any hardware that supports AVX512, so I can't test any of those targets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX and AVX2 targets perform better with i32x4 width #16

AVX and AVX2 targets perform better with i32x4 width #16

akb825 commented Jul 19, 2022

richgel999 commented Jul 19, 2022

akb825 commented Jul 19, 2022 •

edited

Loading

AVX and AVX2 targets perform better with i32x4 width #16

AVX and AVX2 targets perform better with i32x4 width #16

Comments

akb825 commented Jul 19, 2022

richgel999 commented Jul 19, 2022

akb825 commented Jul 19, 2022 • edited Loading

akb825 commented Jul 19, 2022 •

edited

Loading