LibCrypto: Improve GHash / GCM performance #24951

MarekKnapek · 2024-08-21T14:43:36Z

Before:

$ ./Build/lagom/bin/crypto-bench ghash
Benchmarking ghash...
Running benchmark for ghash with size 16 for ~3000ms...3001ms, 1360348 ops, 4.7 MiB/s
Running benchmark for ghash with size 1024 for ~3000ms...3001ms, 38981 ops, 12.3 MiB/s
Running benchmark for ghash with size 16384 for ~3000ms...3001ms, 2267 ops, 11.4 MiB/s
Running benchmark for ghash with size 262144 for ~3000ms...3021ms, 121 ops, 9.5 MiB/s
Running benchmark for ghash with size 1048576 for ~3000ms...3035ms, 33 ops, 10.4 MiB/s
Running benchmark for ghash with size 16777216 for ~3000ms...3888ms, 2 ops, 7.6 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
ghash                16 B       2          569799     2          4.7 MiB   /s
ghash                1.0 KiB    53         608996     77         12.3 MiB  /s
ghash                16.0 KiB   833        686917     1323       11.4 MiB  /s
ghash                256.0 KiB  13665      792689     24964      9.5 MiB   /s
ghash                1.0 MiB    55907      760158     91949      10.4 MiB  /s
ghash                16.0 MiB   1764273    2122790    1943531    7.6 MiB   /s

After:

$ ./Build/lagom/bin/crypto-bench ghash
Benchmarking ghash...
Running benchmark for ghash with size 16 for ~3000ms...3001ms, 4387171 ops, 12.3 MiB/s
Running benchmark for ghash with size 1024 for ~3000ms...3001ms, 162412 ops, 51.4 MiB/s
Running benchmark for ghash with size 16384 for ~3000ms...3071ms, 5510 ops, 27.6 MiB/s
Running benchmark for ghash with size 262144 for ~3000ms...3003ms, 215 ops, 17.1 MiB/s
Running benchmark for ghash with size 1048576 for ~3000ms...3008ms, 105 ops, 34.3 MiB/s
Running benchmark for ghash with size 16777216 for ~3000ms...3337ms, 7 ops, 33.3 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
ghash                16 B       1          322042     1          12.3 MiB  /s
ghash                1.0 KiB    13         287760     18         51.4 MiB  /s
ghash                16.0 KiB   194        750283     557        27.6 MiB  /s
ghash                256.0 KiB  3246       762895     13966      17.1 MiB  /s
ghash                1.0 MiB    13144      776244     28645      34.3 MiB  /s
ghash                16.0 MiB   237858     1063187    476582     33.3 MiB  /s

MarekKnapek · 2024-08-21T15:13:30Z

Before:

$ ./Build/lagom/bin/crypto-bench aes_128_gcm
Benchmarking aes_128_gcm...
Running benchmark for aes_128_gcm with size 16 for ~3000ms...4396ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1024 for ~3000ms...4058ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 16384 for ~3000ms...4135ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 262144 for ~3000ms...3598ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1048576 for ~3000ms...4366ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 16777216 for ~3000ms...3808ms, 2 ops, 7.6 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
aes_128_gcm          16 B       978709     1773534    1465106    0 B       /s
aes_128_gcm          1.0 KiB    1017568    1578794    1352377    0 B       /s
aes_128_gcm          16.0 KiB   931093     1655663    1378016    0 B       /s
aes_128_gcm          256.0 KiB  977305     1600469    1199293    0 B       /s
aes_128_gcm          1.0 MiB    1384175    1584318    1455297    0 B       /s
aes_128_gcm          16.0 MiB   1463492    2343490    1903491    7.6 MiB   /s

After:

$ ./Build/lagom/bin/crypto-bench aes_128_gcm
Benchmarking aes_128_gcm...
Running benchmark for aes_128_gcm with size 16 for ~3000ms...3259ms, 11 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1024 for ~3000ms...3043ms, 12 ops, 0 B/s
Running benchmark for aes_128_gcm with size 16384 for ~3000ms...3119ms, 8 ops, 0 B/s
Running benchmark for aes_128_gcm with size 262144 for ~3000ms...3067ms, 10 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1048576 for ~3000ms...3169ms, 7 ops, 1.9 MiB/s
Running benchmark for aes_128_gcm with size 16777216 for ~3000ms...3054ms, 8 ops, 41.0 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
aes_128_gcm          16 B       219750     556014     296251     0 B       /s
aes_128_gcm          1.0 KiB    211237     336883     253507     0 B       /s
aes_128_gcm          16.0 KiB   222038     741356     389813     0 B       /s
aes_128_gcm          256.0 KiB  218191     824766     306629     0 B       /s
aes_128_gcm          1.0 MiB    220812     1020914    452577     1.9 MiB   /s
aes_128_gcm          16.0 MiB   268202     1003844    381701     41.0 MiB  /s

alimpfard

I would really, really appreciate some comments on how this blob functions :)

Nice work!

Hendiadyoin1 · 2024-08-21T17:53:07Z

I agree with Ali,
it is currently a bit hard to read and understand whats going on

It generally looks a lot like it's based on a vector algorithm,
Here's a rough draft of what I was able to come up with for the clmul part: https://godbolt.org/z/G46T8KGr7

Userland/Libraries/LibCrypto/Authentication/GHash.cpp

MarekKnapek · 2024-08-22T14:18:25Z

I would really, really appreciate some comments on how this blob functions :)

Nice work!

Added some comments.

Hendiadyoin1

Some comments,
some rambling

Hendiadyoin1 · 2024-08-22T18:06:19Z

AK/SIMD.h

+struct MakeVectorImpl {
+    using Type __attribute__((vector_size(sizeof(T) * element_count))) = T;
+};


You sure this works on GCC?
I had issues with dependent vector sizes on there

Yes, this works on my machine, Ubuntu with GCC. Source: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88600#c1

Hendiadyoin1 · 2024-08-22T18:07:47Z

AK/SIMDExtras.h

+template<SIMDVector T, size_t... Idx>
+ALWAYS_INLINE static ElementOf<T> reduce_or_impl(T const& a, IndexSequence<Idx...> const&)
+{
+    static_assert(is_power_of_two(vector_length<T>));


Technically not a requirement, just a limitation of the generic impl,
Clang officially uses an even-odd pattern for their builtin

Yes, you are correct. But I in practice there is always power of two. And I was not feeling like implementing general case. Might add /*todo*/.

Hendiadyoin1 · 2024-08-22T18:11:03Z

AK/SIMDExtras.h

+    if constexpr (N == 1) {
+        return a[0] | a[1];
+    } else {
+        return reduce_or_impl(MakeVector<E, N> { (a[Idx])... }, MakeIndexSequence<N / 2>()) | reduce_or_impl(MakeVector<E, N> { (a[N + Idx])... }, MakeIndexSequence<N / 2>());


shouldnt (a[Idx]|...) or similar work as well?

Aaaaa, that would save me much effort. Thank you, will test it out.

Hendiadyoin1 · 2024-08-22T18:11:12Z

AK/SIMDExtras.h

+}
+
+template<SIMDVector T, size_t... Idx>
+ALWAYS_INLINE static ElementOf<T> reduce_xor_impl(T const& a, IndexSequence<Idx...> const&)


See both comments above

Hendiadyoin1 · 2024-08-22T18:12:44Z

AK/SIMDExtras.h

+#    if __has_builtin(__builtin_reduce_or)
+    if (true) {
+        return __builtin_reduce_or(a);
+    } else
+#    endif


Weird pattern, shouldnt if constexpr(...) work as well and be nicer?

Also in which case is __has_builtin not defined, we dont really care for MSVC

I was trying to fix this problem: https://github.com/SerenityOS/serenity/actions/runs/10498331182/job/29083055403#step:10:2402 it seems that compiler doesn't like the if constexpr pattern in that specific case.

Ah sure, then go through the pre-processor, but the #if defined __has_builtin should be redundant

Oh cross that, making it
__has_builtin(__builtin_reduce_or) && DependentTrue<T> should make it work
(weird rules around when a false constexpr branch is checked)

Hendiadyoin1 · 2024-08-22T18:13:57Z

Userland/Libraries/LibCrypto/Authentication/GHash.cpp

+     */
+    using namespace AK::SIMD;
+
+    static auto const rotate_left = [](u32x4 const& x) -> u32x4 {


Thats a rotation right, isn't it?

I thought that 0x1234 << 4 == 0x2340 is left shift and 0x1234 >> 4 == 0x0123 is right shift. Here you write the digits in "big endian" order. In case of u32x4 vec{ 1, 2, 3, 4 }, you write the vector elements in "little endian" order. So the rotation is "right" on screen, but "left" regarding the bits...I guess.

Yeah, probably a point of view thing,
maybe rename the helper to express the scope its working on

Hendiadyoin1 · 2024-08-22T18:14:05Z

Userland/Libraries/LibCrypto/Authentication/GHash.cpp

+#if defined __has_builtin
+#    if __has_builtin(__builtin_ia32_pmuludq128)
+        if (true) {


Hendiadyoin1 · 2024-08-22T18:15:25Z

Userland/Libraries/LibCrypto/Authentication/GHash.cpp

+            r1 = u64x2 { static_cast<u64>(a[0]) * static_cast<u64>(b[0]), static_cast<u64>(a[1]) * static_cast<u64>(b[1]) };
+            r2 = u64x2 { static_cast<u64>(a[2]) * static_cast<u64>(b[2]), static_cast<u64>(a[3]) * static_cast<u64>(b[3]) };


Turns out return to<u64x4>(a)*to<u64x4>(b) has slightly better codegen (and emits pmuluqd in my tests)

This is copy & pasta from your suggestion. But OK, will improve.

Yeah, on discord I corrected my self, the link posted here has the other version
also s/to/simd_cast/, I can't remember names

Hendiadyoin1 · 2024-08-22T18:17:33Z

Userland/Libraries/LibCrypto/Authentication/GHash.cpp

+    u32 aa[4];
+    u32 bb[4];
+    u32 ta[9];
+    u32 tb[9];
+    u32 tc[4];
+    u32 tu32[4];
+    u32 td[4];
+    u32 te[4];
+    u32 z[8];
+
+    aa[3] = _x[0];
+    aa[2] = _x[1];
+    aa[1] = _x[2];
+    aa[0] = _x[3];
+    bb[3] = _y[0];
+    bb[2] = _y[1];
+    bb[1] = _y[2];
+    bb[0] = _y[3];
+    ta[0] = aa[0];
+    ta[1] = aa[1];
+    ta[2] = ta[0] ^ ta[1];
+    ta[3] = aa[2];
+    ta[4] = aa[3];
+    ta[5] = ta[3] ^ ta[4];
+    ta[6] = ta[0] ^ ta[3];
+    ta[7] = ta[1] ^ ta[4];
+    ta[8] = ta[6] ^ ta[7];
+    tb[0] = bb[0];
+    tb[1] = bb[1];
+    tb[2] = tb[0] ^ tb[1];
+    tb[3] = bb[2];
+    tb[4] = bb[3];
+    tb[5] = tb[3] ^ tb[4];
+    tb[6] = tb[0] ^ tb[3];
+    tb[7] = tb[1] ^ tb[4];
+    tb[8] = tb[6] ^ tb[7];


This part still feels odd, it looks like it might be vectors but then the ta/tb fall out of place,
any neater way this can be described?

Also comments, on how this works would be nice

I think you could also do Karatsuba with the cmul, if that isn't whats happening here
Thats what the intel white paper does with 128 bit width pcmulqdq, so two/four rounds of it should get us there

This is Karatsuba inspired by BearSSL.

Hendiadyoin1 · 2024-08-22T18:25:29Z

AK/SIMDExtras.h

+    static_assert(is_power_of_two(vector_length<T>));
+    static_assert(vector_length<T> == sizeof...(Idx) * 2);
+


There should also be requires clauses

Hendiadyoin1 · 2024-08-22T19:03:35Z

Userland/Libraries/LibCrypto/Authentication/GHash.cpp

+#if defined __has_builtin
+#    if __has_builtin(__builtin_ia32_pmuludq128)
+        if (true) {


Ah forgot to mention it, this needs an x86 guard

stale · 2024-09-18T04:58:54Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale · 2024-11-05T03:30:31Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale · 2024-11-27T08:00:52Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

stale · 2024-12-22T02:59:40Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

MarekKnapek requested a review from alimpfard as a code owner August 21, 2024 14:43

github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Aug 21, 2024

alimpfard reviewed Aug 21, 2024

View reviewed changes

alimpfard reviewed Aug 22, 2024

View reviewed changes

Userland/Libraries/LibCrypto/Authentication/GHash.cpp Outdated Show resolved Hide resolved

MarekKnapek force-pushed the gcm3 branch from 4d86a3d to 4dff687 Compare August 22, 2024 14:16

MarekKnapek force-pushed the gcm3 branch from 4dff687 to f2e03f6 Compare August 22, 2024 14:29

Hendiadyoin1 reviewed Aug 22, 2024

View reviewed changes

stale bot added the stale label Sep 18, 2024

MarekKnapek force-pushed the gcm3 branch from f2e03f6 to 1d1acf7 Compare September 25, 2024 10:40

stale bot removed the stale label Sep 25, 2024

stale bot added the stale label Nov 5, 2024

MarekKnapek force-pushed the gcm3 branch from 1d1acf7 to a3bb2ab Compare November 5, 2024 11:39

stale bot removed the stale label Nov 5, 2024

stale bot added the stale label Nov 27, 2024

MarekKnapek force-pushed the gcm3 branch from a3bb2ab to 93108dd Compare November 27, 2024 13:35

stale bot removed the stale label Nov 27, 2024

stale bot added the stale label Dec 22, 2024

MarekKnapek added 3 commits December 22, 2024 17:34

LibCrypto: Improve GHash / GCM performance

e573b2b

LibCrypto: SIMDify GHash

d70e798

LibCrypto: Add docs

34d3c56

MarekKnapek force-pushed the gcm3 branch from 93108dd to 34d3c56 Compare December 22, 2024 17:38

stale bot removed the stale label Dec 22, 2024

		r1 = u64x2 { static_cast<u64>(a[0]) * static_cast<u64>(b[0]), static_cast<u64>(a[1]) * static_cast<u64>(b[1]) };
		r2 = u64x2 { static_cast<u64>(a[2]) * static_cast<u64>(b[2]), static_cast<u64>(a[3]) * static_cast<u64>(b[3]) };

		static_assert(is_power_of_two(vector_length<T>));
		static_assert(vector_length<T> == sizeof...(Idx) * 2);

LibCrypto: Improve GHash / GCM performance #24951

Are you sure you want to change the base?

LibCrypto: Improve GHash / GCM performance #24951

Conversation

MarekKnapek commented Aug 21, 2024

MarekKnapek commented Aug 21, 2024

alimpfard left a comment

Choose a reason for hiding this comment

Hendiadyoin1 commented Aug 21, 2024

MarekKnapek commented Aug 22, 2024

Hendiadyoin1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarekKnapek Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Sep 18, 2024

stale bot commented Nov 5, 2024

stale bot commented Nov 27, 2024

stale bot commented Dec 22, 2024

MarekKnapek Aug 22, 2024 •

edited

Loading