-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LibCrypto: Improve GHash / GCM performance #24951
base: master
Are you sure you want to change the base?
Conversation
Before:
After:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really, really appreciate some comments on how this blob functions :)
Nice work!
I agree with Ali, It generally looks a lot like it's based on a vector algorithm, |
Added some comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments,
some rambling
struct MakeVectorImpl { | ||
using Type __attribute__((vector_size(sizeof(T) * element_count))) = T; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You sure this works on GCC?
I had issues with dependent vector sizes on there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this works on my machine, Ubuntu with GCC. Source: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88600#c1
template<SIMDVector T, size_t... Idx> | ||
ALWAYS_INLINE static ElementOf<T> reduce_or_impl(T const& a, IndexSequence<Idx...> const&) | ||
{ | ||
static_assert(is_power_of_two(vector_length<T>)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not a requirement, just a limitation of the generic impl,
Clang officially uses an even-odd pattern for their builtin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. But I in practice there is always power of two. And I was not feeling like implementing general case. Might add /*todo*/
.
if constexpr (N == 1) { | ||
return a[0] | a[1]; | ||
} else { | ||
return reduce_or_impl(MakeVector<E, N> { (a[Idx])... }, MakeIndexSequence<N / 2>()) | reduce_or_impl(MakeVector<E, N> { (a[N + Idx])... }, MakeIndexSequence<N / 2>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt (a[Idx]|...)
or similar work as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaaa, that would save me much effort. Thank you, will test it out.
} | ||
|
||
template<SIMDVector T, size_t... Idx> | ||
ALWAYS_INLINE static ElementOf<T> reduce_xor_impl(T const& a, IndexSequence<Idx...> const&) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See both comments above
# if __has_builtin(__builtin_reduce_or) | ||
if (true) { | ||
return __builtin_reduce_or(a); | ||
} else | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird pattern, shouldnt if constexpr(...)
work as well and be nicer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in which case is __has_builtin
not defined, we dont really care for MSVC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to fix this problem: https://github.com/SerenityOS/serenity/actions/runs/10498331182/job/29083055403#step:10:2402 it seems that compiler doesn't like the if constexpr
pattern in that specific case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, then go through the pre-processor, but the #if defined __has_builtin
should be redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cross that, making it
__has_builtin(__builtin_reduce_or) && DependentTrue<T>
should make it work
(weird rules around when a false constexpr branch is checked)
*/ | ||
using namespace AK::SIMD; | ||
|
||
static auto const rotate_left = [](u32x4 const& x) -> u32x4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a rotation right, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that 0x1234 << 4 == 0x2340
is left shift and 0x1234 >> 4 == 0x0123
is right shift. Here you write the digits in "big endian" order. In case of u32x4 vec{ 1, 2, 3, 4 }
, you write the vector elements in "little endian" order. So the rotation is "right" on screen, but "left" regarding the bits...I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably a point of view thing,
maybe rename the helper to express the scope its working on
#if defined __has_builtin | ||
# if __has_builtin(__builtin_ia32_pmuludq128) | ||
if (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
r1 = u64x2 { static_cast<u64>(a[0]) * static_cast<u64>(b[0]), static_cast<u64>(a[1]) * static_cast<u64>(b[1]) }; | ||
r2 = u64x2 { static_cast<u64>(a[2]) * static_cast<u64>(b[2]), static_cast<u64>(a[3]) * static_cast<u64>(b[3]) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out return to<u64x4>(a)*to<u64x4>(b)
has slightly better codegen (and emits pmuluqd
in my tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copy & pasta from your suggestion. But OK, will improve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, on discord I corrected my self, the link posted here has the other version
also s/to/simd_cast/
, I can't remember names
u32 aa[4]; | ||
u32 bb[4]; | ||
u32 ta[9]; | ||
u32 tb[9]; | ||
u32 tc[4]; | ||
u32 tu32[4]; | ||
u32 td[4]; | ||
u32 te[4]; | ||
u32 z[8]; | ||
|
||
aa[3] = _x[0]; | ||
aa[2] = _x[1]; | ||
aa[1] = _x[2]; | ||
aa[0] = _x[3]; | ||
bb[3] = _y[0]; | ||
bb[2] = _y[1]; | ||
bb[1] = _y[2]; | ||
bb[0] = _y[3]; | ||
ta[0] = aa[0]; | ||
ta[1] = aa[1]; | ||
ta[2] = ta[0] ^ ta[1]; | ||
ta[3] = aa[2]; | ||
ta[4] = aa[3]; | ||
ta[5] = ta[3] ^ ta[4]; | ||
ta[6] = ta[0] ^ ta[3]; | ||
ta[7] = ta[1] ^ ta[4]; | ||
ta[8] = ta[6] ^ ta[7]; | ||
tb[0] = bb[0]; | ||
tb[1] = bb[1]; | ||
tb[2] = tb[0] ^ tb[1]; | ||
tb[3] = bb[2]; | ||
tb[4] = bb[3]; | ||
tb[5] = tb[3] ^ tb[4]; | ||
tb[6] = tb[0] ^ tb[3]; | ||
tb[7] = tb[1] ^ tb[4]; | ||
tb[8] = tb[6] ^ tb[7]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part still feels odd, it looks like it might be vectors but then the ta/tb fall out of place,
any neater way this can be described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also comments, on how this works would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could also do Karatsuba with the cmul, if that isn't whats happening here
Thats what the intel white paper does with 128 bit width pcmulqdq
, so two/four rounds of it should get us there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Karatsuba inspired by BearSSL.
static_assert(is_power_of_two(vector_length<T>)); | ||
static_assert(vector_length<T> == sizeof...(Idx) * 2); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be requires
clauses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
#if defined __has_builtin | ||
# if __has_builtin(__builtin_ia32_pmuludq128) | ||
if (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah forgot to mention it, this needs an x86 guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! |
Before:
After: