Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LibCrypto: Improve GHash / GCM performance #24951

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

MarekKnapek
Copy link
Contributor

Before:

$ ./Build/lagom/bin/crypto-bench ghash
Benchmarking ghash...
Running benchmark for ghash with size 16 for ~3000ms...3001ms, 1360348 ops, 4.7 MiB/s
Running benchmark for ghash with size 1024 for ~3000ms...3001ms, 38981 ops, 12.3 MiB/s
Running benchmark for ghash with size 16384 for ~3000ms...3001ms, 2267 ops, 11.4 MiB/s
Running benchmark for ghash with size 262144 for ~3000ms...3021ms, 121 ops, 9.5 MiB/s
Running benchmark for ghash with size 1048576 for ~3000ms...3035ms, 33 ops, 10.4 MiB/s
Running benchmark for ghash with size 16777216 for ~3000ms...3888ms, 2 ops, 7.6 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
ghash                16 B       2          569799     2          4.7 MiB   /s
ghash                1.0 KiB    53         608996     77         12.3 MiB  /s
ghash                16.0 KiB   833        686917     1323       11.4 MiB  /s
ghash                256.0 KiB  13665      792689     24964      9.5 MiB   /s
ghash                1.0 MiB    55907      760158     91949      10.4 MiB  /s
ghash                16.0 MiB   1764273    2122790    1943531    7.6 MiB   /s

After:

$ ./Build/lagom/bin/crypto-bench ghash
Benchmarking ghash...
Running benchmark for ghash with size 16 for ~3000ms...3001ms, 4387171 ops, 12.3 MiB/s
Running benchmark for ghash with size 1024 for ~3000ms...3001ms, 162412 ops, 51.4 MiB/s
Running benchmark for ghash with size 16384 for ~3000ms...3071ms, 5510 ops, 27.6 MiB/s
Running benchmark for ghash with size 262144 for ~3000ms...3003ms, 215 ops, 17.1 MiB/s
Running benchmark for ghash with size 1048576 for ~3000ms...3008ms, 105 ops, 34.3 MiB/s
Running benchmark for ghash with size 16777216 for ~3000ms...3337ms, 7 ops, 33.3 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
ghash                16 B       1          322042     1          12.3 MiB  /s
ghash                1.0 KiB    13         287760     18         51.4 MiB  /s
ghash                16.0 KiB   194        750283     557        27.6 MiB  /s
ghash                256.0 KiB  3246       762895     13966      17.1 MiB  /s
ghash                1.0 MiB    13144      776244     28645      34.3 MiB  /s
ghash                16.0 MiB   237858     1063187    476582     33.3 MiB  /s

@MarekKnapek MarekKnapek requested a review from alimpfard as a code owner August 21, 2024 14:43
@github-actions github-actions bot added the 👀 pr-needs-review PR needs review from a maintainer or community member label Aug 21, 2024
@MarekKnapek
Copy link
Contributor Author

Before:

$ ./Build/lagom/bin/crypto-bench aes_128_gcm
Benchmarking aes_128_gcm...
Running benchmark for aes_128_gcm with size 16 for ~3000ms...4396ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1024 for ~3000ms...4058ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 16384 for ~3000ms...4135ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 262144 for ~3000ms...3598ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1048576 for ~3000ms...4366ms, 3 ops, 0 B/s
Running benchmark for aes_128_gcm with size 16777216 for ~3000ms...3808ms, 2 ops, 7.6 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
aes_128_gcm          16 B       978709     1773534    1465106    0 B       /s
aes_128_gcm          1.0 KiB    1017568    1578794    1352377    0 B       /s
aes_128_gcm          16.0 KiB   931093     1655663    1378016    0 B       /s
aes_128_gcm          256.0 KiB  977305     1600469    1199293    0 B       /s
aes_128_gcm          1.0 MiB    1384175    1584318    1455297    0 B       /s
aes_128_gcm          16.0 MiB   1463492    2343490    1903491    7.6 MiB   /s

After:

$ ./Build/lagom/bin/crypto-bench aes_128_gcm
Benchmarking aes_128_gcm...
Running benchmark for aes_128_gcm with size 16 for ~3000ms...3259ms, 11 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1024 for ~3000ms...3043ms, 12 ops, 0 B/s
Running benchmark for aes_128_gcm with size 16384 for ~3000ms...3119ms, 8 ops, 0 B/s
Running benchmark for aes_128_gcm with size 262144 for ~3000ms...3067ms, 10 ops, 0 B/s
Running benchmark for aes_128_gcm with size 1048576 for ~3000ms...3169ms, 7 ops, 1.9 MiB/s
Running benchmark for aes_128_gcm with size 16777216 for ~3000ms...3054ms, 8 ops, 41.0 MiB/s
Algorithm            Size       Min us/op  Max us/op  Avg us/op  Throughput
aes_128_gcm          16 B       219750     556014     296251     0 B       /s
aes_128_gcm          1.0 KiB    211237     336883     253507     0 B       /s
aes_128_gcm          16.0 KiB   222038     741356     389813     0 B       /s
aes_128_gcm          256.0 KiB  218191     824766     306629     0 B       /s
aes_128_gcm          1.0 MiB    220812     1020914    452577     1.9 MiB   /s
aes_128_gcm          16.0 MiB   268202     1003844    381701     41.0 MiB  /s

Copy link
Member

@alimpfard alimpfard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really, really appreciate some comments on how this blob functions :)

Nice work!

@Hendiadyoin1
Copy link
Contributor

I agree with Ali,
it is currently a bit hard to read and understand whats going on

It generally looks a lot like it's based on a vector algorithm,
Here's a rough draft of what I was able to come up with for the clmul part: https://godbolt.org/z/G46T8KGr7

@MarekKnapek
Copy link
Contributor Author

I would really, really appreciate some comments on how this blob functions :)

Nice work!

Added some comments.

Copy link
Contributor

@Hendiadyoin1 Hendiadyoin1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments,
some rambling

Comment on lines +110 to +114
struct MakeVectorImpl {
using Type __attribute__((vector_size(sizeof(T) * element_count))) = T;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You sure this works on GCC?
I had issues with dependent vector sizes on there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works on my machine, Ubuntu with GCC. Source: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88600#c1

template<SIMDVector T, size_t... Idx>
ALWAYS_INLINE static ElementOf<T> reduce_or_impl(T const& a, IndexSequence<Idx...> const&)
{
static_assert(is_power_of_two(vector_length<T>));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not a requirement, just a limitation of the generic impl,
Clang officially uses an even-odd pattern for their builtin

Copy link
Contributor Author

@MarekKnapek MarekKnapek Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. But I in practice there is always power of two. And I was not feeling like implementing general case. Might add /*todo*/.

if constexpr (N == 1) {
return a[0] | a[1];
} else {
return reduce_or_impl(MakeVector<E, N> { (a[Idx])... }, MakeIndexSequence<N / 2>()) | reduce_or_impl(MakeVector<E, N> { (a[N + Idx])... }, MakeIndexSequence<N / 2>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt (a[Idx]|...) or similar work as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaa, that would save me much effort. Thank you, will test it out.

}

template<SIMDVector T, size_t... Idx>
ALWAYS_INLINE static ElementOf<T> reduce_xor_impl(T const& a, IndexSequence<Idx...> const&)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See both comments above

Comment on lines +330 to +334
# if __has_builtin(__builtin_reduce_or)
if (true) {
return __builtin_reduce_or(a);
} else
# endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird pattern, shouldnt if constexpr(...) work as well and be nicer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in which case is __has_builtin not defined, we dont really care for MSVC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to fix this problem: https://github.com/SerenityOS/serenity/actions/runs/10498331182/job/29083055403#step:10:2402 it seems that compiler doesn't like the if constexpr pattern in that specific case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, then go through the pre-processor, but the #if defined __has_builtin should be redundant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cross that, making it
__has_builtin(__builtin_reduce_or) && DependentTrue<T> should make it work
(weird rules around when a false constexpr branch is checked)

*/
using namespace AK::SIMD;

static auto const rotate_left = [](u32x4 const& x) -> u32x4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a rotation right, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that 0x1234 << 4 == 0x2340 is left shift and 0x1234 >> 4 == 0x0123 is right shift. Here you write the digits in "big endian" order. In case of u32x4 vec{ 1, 2, 3, 4 }, you write the vector elements in "little endian" order. So the rotation is "right" on screen, but "left" regarding the bits...I guess.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably a point of view thing,
maybe rename the helper to express the scope its working on

Comment on lines +107 to +109
#if defined __has_builtin
# if __has_builtin(__builtin_ia32_pmuludq128)
if (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Comment on lines +116 to +117
r1 = u64x2 { static_cast<u64>(a[0]) * static_cast<u64>(b[0]), static_cast<u64>(a[1]) * static_cast<u64>(b[1]) };
r2 = u64x2 { static_cast<u64>(a[2]) * static_cast<u64>(b[2]), static_cast<u64>(a[3]) * static_cast<u64>(b[3]) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out return to<u64x4>(a)*to<u64x4>(b) has slightly better codegen (and emits pmuluqd in my tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copy & pasta from your suggestion. But OK, will improve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, on discord I corrected my self, the link posted here has the other version
also s/to/simd_cast/, I can't remember names

Comment on lines +166 to +201
u32 aa[4];
u32 bb[4];
u32 ta[9];
u32 tb[9];
u32 tc[4];
u32 tu32[4];
u32 td[4];
u32 te[4];
u32 z[8];

aa[3] = _x[0];
aa[2] = _x[1];
aa[1] = _x[2];
aa[0] = _x[3];
bb[3] = _y[0];
bb[2] = _y[1];
bb[1] = _y[2];
bb[0] = _y[3];
ta[0] = aa[0];
ta[1] = aa[1];
ta[2] = ta[0] ^ ta[1];
ta[3] = aa[2];
ta[4] = aa[3];
ta[5] = ta[3] ^ ta[4];
ta[6] = ta[0] ^ ta[3];
ta[7] = ta[1] ^ ta[4];
ta[8] = ta[6] ^ ta[7];
tb[0] = bb[0];
tb[1] = bb[1];
tb[2] = tb[0] ^ tb[1];
tb[3] = bb[2];
tb[4] = bb[3];
tb[5] = tb[3] ^ tb[4];
tb[6] = tb[0] ^ tb[3];
tb[7] = tb[1] ^ tb[4];
tb[8] = tb[6] ^ tb[7];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part still feels odd, it looks like it might be vectors but then the ta/tb fall out of place,
any neater way this can be described?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also comments, on how this works would be nice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could also do Karatsuba with the cmul, if that isn't whats happening here
Thats what the intel white paper does with 128 bit width pcmulqdq, so two/four rounds of it should get us there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Karatsuba inspired by BearSSL.

Comment on lines +259 to +261
static_assert(is_power_of_two(vector_length<T>));
static_assert(vector_length<T> == sizeof...(Idx) * 2);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be requires clauses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines +107 to +109
#if defined __has_builtin
# if __has_builtin(__builtin_ia32_pmuludq128)
if (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah forgot to mention it, this needs an x86 guard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link

stale bot commented Sep 18, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

Copy link

stale bot commented Nov 5, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

Copy link

stale bot commented Nov 27, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

Copy link

stale bot commented Dec 22, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions!

@stale stale bot added the stale label Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👀 pr-needs-review PR needs review from a maintainer or community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants