Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More performance improvements with nssdb #137

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

simo5
Copy link
Member

@simo5 simo5 commented Dec 20, 2024

Using valgrind's callgrind tool I am analyzing the behavior of with the nssdb storage backend which shows particularly bad performance in pkcs11-provider's CI.

The first easy pick was to remove the Zeroize crate, which single-handedly accounted for a 50% of the time spent in pbkdf2 operations.

The second easy pick was to rewrite the HAM initialization to use fixed slices instead of vectors for the internal state.

The nssdb is still much slower than the others, due to higher use of the pbkdf2 function, yet nssdb uses the same and it is much faster.

Some playing around shows the "native" version is about 20/30 % slower than the OpenSSL one we use in fips mode ... but there is definitely more work to do to improve performance here.

@simo5 simo5 marked this pull request as draft December 20, 2024 01:37
src/misc.rs Show resolved Hide resolved
@simo5 simo5 marked this pull request as ready for review December 20, 2024 16:31
@simo5 simo5 requested a review from Jakuje December 20, 2024 16:31
@simo5
Copy link
Member Author

simo5 commented Dec 20, 2024

Ok I nailed the second biggest performance issue.
Amazing how much just the "native" HMAC initialization impacted performance.
Switching to slices shaved off another 50% of performance drop in the HMAc function.
Tested the code with a pkcs11-provider CI run and now kryoptic.nss tests are at most 4 times slower than NSS ones in the worst case, and generally close or at most twice as slow, including setup.
The rest of performance issues will probably need to be found in the actual storage layer, and will be handle in a future PR.

@simo5
Copy link
Member Author

simo5 commented Dec 20, 2024

Eh, sounds like I broke something :-D
Hold on

@simo5
Copy link
Member Author

simo5 commented Dec 20, 2024

Ok found a copy/paste bug in the reinit() function that caused the failures.
Btw I tested with a release build, and the performance in pkcs11-provider CI is very close to that of NSS, in the worst case I observed only 30% slower tests, but generally performance is on par and within margin of error of measurement.

This will allow to measure performance with some reliability.

Signed-off-by: Simo Sorce <[email protected]>
The Zeroize crate performance is crazy slow.
The search for a performance issue with the nssdb storage via
valgrind's callgrind revelead that 50% of the time in the HMAC code was
spent on zeroization because of the way the Zeroize crate is built.

Replace it everywhere with a simpler zeromem function that underneath
simply calls OPENSSL_cleanse() on the mutable slice passed in.

With this change alone the new test that performs 100 pbkdf2 calls went
from ~9s to ~4.5s on my laptop.

Signed-off-by: Simo Sorce <[email protected]>
The allocations and zip/iterations were causing another huge performance
hit (at least in debug mode).
Switch HMAC code to use slices instead, and recode the iterations in the
init code to be simpler.

Signed-off-by: Simo Sorce <[email protected]>
Copy link
Contributor

@Jakuje Jakuje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nits.


use constant_time_eq::constant_time_eq;
use zeroize::Zeroize;

/* max algo right now is SHA3_224 with 144 byts blocksize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* max algo right now is SHA3_224 with 144 byts blocksize,
/* max algo right now is SHA3_224 with 144 bytes blocksize,

/* max algo right now is SHA3_224 with 144 byts blocksize,
* use slightly larger for good measure (and alignment) */
const IPAD_INIT: [u8; 160] = [0x36; 160];
const OPAD_INIT: [u8; 160] = [0x5c; 160];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make the number 160 constant instead of using it on at least 5 different places here and below?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants