More performance improvements with nssdb #137

simo5 · 2024-12-20T01:36:48Z

Using valgrind's callgrind tool I am analyzing the behavior of with the nssdb storage backend which shows particularly bad performance in pkcs11-provider's CI.

The first easy pick was to remove the Zeroize crate, which single-handedly accounted for a 50% of the time spent in pbkdf2 operations.

The second easy pick was to rewrite the HAM initialization to use fixed slices instead of vectors for the internal state.

The nssdb is still much slower than the others, due to higher use of the pbkdf2 function, yet nssdb uses the same and it is much faster.

Some playing around shows the "native" version is about 20/30 % slower than the OpenSSL one we use in fips mode ... but there is definitely more work to do to improve performance here.

src/misc.rs

simo5 · 2024-12-20T16:33:52Z

Ok I nailed the second biggest performance issue.
Amazing how much just the "native" HMAC initialization impacted performance.
Switching to slices shaved off another 50% of performance drop in the HMAc function.
Tested the code with a pkcs11-provider CI run and now kryoptic.nss tests are at most 4 times slower than NSS ones in the worst case, and generally close or at most twice as slow, including setup.
The rest of performance issues will probably need to be found in the actual storage layer, and will be handle in a future PR.

simo5 · 2024-12-20T16:46:16Z

Eh, sounds like I broke something :-D
Hold on

simo5 · 2024-12-20T18:09:27Z

Ok found a copy/paste bug in the reinit() function that caused the failures.
Btw I tested with a release build, and the performance in pkcs11-provider CI is very close to that of NSS, in the worst case I observed only 30% slower tests, but generally performance is on par and within margin of error of measurement.

This will allow to measure performance with some reliability. Signed-off-by: Simo Sorce <[email protected]>

The Zeroize crate performance is crazy slow. The search for a performance issue with the nssdb storage via valgrind's callgrind revelead that 50% of the time in the HMAC code was spent on zeroization because of the way the Zeroize crate is built. Replace it everywhere with a simpler zeromem function that underneath simply calls OPENSSL_cleanse() on the mutable slice passed in. With this change alone the new test that performs 100 pbkdf2 calls went from ~9s to ~4.5s on my laptop. Signed-off-by: Simo Sorce <[email protected]>

The allocations and zip/iterations were causing another huge performance hit (at least in debug mode). Switch HMAC code to use slices instead, and recode the iterations in the init code to be simpler. Signed-off-by: Simo Sorce <[email protected]>

Jakuje

just nits.

Jakuje · 2024-12-21T21:02:12Z

src/native/hmac.rs


 use constant_time_eq::constant_time_eq;
-use zeroize::Zeroize;
+
+/* max algo right now is SHA3_224 with 144 byts blocksize,


Suggested change

/* max algo right now is SHA3_224 with 144 byts blocksize,

/* max algo right now is SHA3_224 with 144 bytes blocksize,

Jakuje · 2024-12-21T21:06:25Z

src/native/hmac.rs

+/* max algo right now is SHA3_224 with 144 byts blocksize,
+ * use slightly larger for good measure (and alignment) */
+const IPAD_INIT: [u8; 160] = [0x36; 160];
+const OPAD_INIT: [u8; 160] = [0x5c; 160];


should we make the number 160 constant instead of using it on at least 5 different places here and below?

simo5 marked this pull request as draft December 20, 2024 01:37

Jakuje reviewed Dec 20, 2024

View reviewed changes

src/misc.rs Show resolved Hide resolved

simo5 marked this pull request as ready for review December 20, 2024 16:31

simo5 requested a review from Jakuje December 20, 2024 16:31

simo5 force-pushed the nssperf2 branch from 4ae5e1f to 1764158 Compare December 20, 2024 16:37

simo5 force-pushed the nssperf2 branch from 1764158 to 6dab1a0 Compare December 20, 2024 18:02

simo5 added 3 commits December 20, 2024 13:09

Add Test to calculate 100 key derivations

c3c4c87

This will allow to measure performance with some reliability. Signed-off-by: Simo Sorce <[email protected]>

simo5 force-pushed the nssperf2 branch from 6dab1a0 to 02cbc5a Compare December 20, 2024 18:09

Jakuje approved these changes Dec 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More performance improvements with nssdb #137

More performance improvements with nssdb #137

simo5 commented Dec 20, 2024 •

edited

Loading

simo5 commented Dec 20, 2024

simo5 commented Dec 20, 2024

simo5 commented Dec 20, 2024

Jakuje left a comment

Jakuje Dec 21, 2024

Jakuje Dec 21, 2024

	/* max algo right now is SHA3_224 with 144 byts blocksize,
	/* max algo right now is SHA3_224 with 144 bytes blocksize,

More performance improvements with nssdb #137

Are you sure you want to change the base?

More performance improvements with nssdb #137

Conversation

simo5 commented Dec 20, 2024 • edited Loading

simo5 commented Dec 20, 2024

simo5 commented Dec 20, 2024

simo5 commented Dec 20, 2024

Jakuje left a comment

Choose a reason for hiding this comment

Jakuje Dec 21, 2024

Choose a reason for hiding this comment

Jakuje Dec 21, 2024

Choose a reason for hiding this comment

simo5 commented Dec 20, 2024 •

edited

Loading