ci: calibrate performance alert threshold #5070

agostbiro · 2024-04-03T17:59:41Z

This PR contains the following changes:

Filter out eth_getBlockByNumber for synthetix as it was causing GC pressure and high max/min ratio (138%). The max/min ration synthetix is now down to 101%.
Increase the alert threshold to 115%. This is needed due to higher variance than previously measured with fever iterations for small scenarios.

It is possible that there will be false positives even with a threshold of 115% for two reasons:

The max/min ratio on safe-contracts is 116.7%, but ignoring the slowest and fastest runs, it's 112.2% and it's below 110% for all other scenarios except neptune-mutual.
The max/min ratio on neptune-mutual is 137.5%. This is very surprising, as on a previous variance measurement run of 30 iterations the max/min ratio on neptune-mutual was <105%. Ignoring the slowest and fastest runs the max/min ratio of neptune-mutual is 102.8%, so the 137.5% ratio is a wild outlier.

I didn't want to set a higher alert threshold than 115% to accommodate the two cases above, since that would render the alert threshold useless for other scenarios, so if we get a regression alert, we should keep this in mind and consult the benchmark-variance notebook to decide if there is an actual regression.

changeset-bot · 2024-04-03T17:59:46Z

⚠️ No Changeset found

Latest commit: 1230e4d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2024-04-03T17:59:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
hardhat	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Apr 10, 2024 0:15am

Wodann · 2024-04-03T19:02:07Z

I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the Provider API of edr_napi? Was the variance still an issue in that case?

agostbiro · 2024-04-04T06:59:41Z

I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the Provider API of edr_napi? Was the variance still an issue in that case?

No, how do you think that'd help?

agostbiro · 2024-04-04T07:03:45Z

Regarding the neptune-mutual variance it's also useful to know that it has a varying number of failures between runs. While the fast outlier mentioned above had an approximate number of failures that slower runs had as well, it's possible that the particular transactions skipped take up more time.

Wodann · 2024-04-04T08:36:59Z

I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the Provider API of edr_napi? Was the variance still an issue in that case?

No, how do you think that'd help?

Your hypothesis for variability was the JS garbage collector. We tried to mitigate that using configuration variables, but were unable to remove variability completely. By directly using the Provider API of edr_napi, you completely circumvent the GC (and JS), instead running the tests in pure Rust. The only thing we'd not incorporate in the benchmarks is the marshalling of strings, but if it removes our reliability problems it seems like a trade-off to consider.

agostbiro · 2024-04-04T09:22:06Z

Your hypothesis for variability was the JS garbage collector. We tried to mitigate that using configuration variables, but were unable to remove variability completely.

We've confirmed that for synthetix the cause of the variance is GC and we have the max/min ratio down to 101% now for that scenario. We have two scenarios left with observed max/min ratio over 110%:

safe-contracts which takes very short time (~3.5s) so higher variance is to be expected.
neptune-mutual where we observed one outlier that was significantly faster than the rest. Ignoring this outlier the max/min ratio was 102.8%. We know that there is variability in which RPC calls fail for neptune-mutual so I think the most likely reason for the outlier that significant work was skipped due to a rare failure.

By directly using the Provider API of edr_napi, you completely circumvent the GC (and JS), instead running the tests in pure Rust. The only thing we'd not incorporate in the benchmarks is the marshalling of strings, but if it removes our reliability problems it seems like a trade-off to consider.

I don't think it would solve the two issues from above and imo it's important to run the benchmarks the same way users do. If we want to bring the max/min ratio under 110% and eliminate false positives, we could:

Run very short scenarios multiple time for each benchmark run and take the median.
Exclude neptune-mutual and openzeppelin where there is variability in which RPC calls fail.

I think the first fix is low impact, so I'm not inclined to do it until it gets too annoying. The second fix would be higher impact, but otoh neptune-mutual and openzeppelin are very useful to test certain performance aspects (large state, deep reverts), so I'm reluctant to remove these.

Wodann · 2024-04-04T09:48:18Z

For me there are two fundamental flaws that need to be solved:

We know that there is variability in which RPC calls fail for neptune-mutual so I think the most likely reason for the outlier that significant work was skipped due to a rare failure.

This signals that the benchmark is non-deterministic, which it shouldn't be. It either signals a bug in EDR or I'm missing something in my mental model that would rule out deterministic execution.

Ramping up the tolerance to 115% or excluding (parts of) scenarios seems like hacks/cherry-picking that reduce the usefulness of the benchmarks. 115% allows large regressions to go unnoticed and excluding scenarios makes us completely blind to their performance.

I like your proposed solution for reducing the variance in safe-contracts:

Run very short scenarios multiple time for each benchmark run and take the median.

For short benchmarks you always run it N times and just store the aggregate to account for variance.

I'd be against the other proposal, for the 2nd flaw that I listed above:

Exclude neptune-mutual and openzeppelin where there is variability in which RPC calls fail.

I'd be in favour of figuring out what's the underlaying cause and fix the bug / correct my mental model about provider calls being deterministic (except for timestamps). Even with timestamps affecting the block hashes, I'd expect the success/failure rates to be the same after a scenario has been recorded.

agostbiro · 2024-04-04T14:34:47Z

Re the openzeppelin / neptune-mutual failure variability issue: It was investigated in this PR and a follow up task was created so I'd say that's an independent issue from this PR.

agostbiro · 2024-04-04T15:06:46Z

Re the variance in short scenarios: I added a quick and dirty fix to run short scenarios repeatedly and report the median. It's not the sum to keep results comparable. As a result I reduced the threshold to 110% which is conservative.

Wodann · 2024-04-04T16:45:46Z

Re the openzeppelin / neptune-mutual failure variability issue: It was investigated in this PR and a follow up task was created so I'd say that's an independent issue from this PR.

AFAICT, this is the last remaining issue to calibrate/fix the variance in benchmarks - based on the tracking issue. Which issue are you referring to?

Wodann

As a result I reduced the threshold to 110% which is conservative.

Nice 🎉

To me that's an acceptable threshold until we fix/figure out the determinism issue. So if we have a high priority ticket to investigate that, this looks good to me.

Also, I just wanted to share my general thinking on this: I'd rather have some false positive regression reports that require a manual look at the benchmark results than allowing false negative regressions to slip through.

Wodann · 2024-04-04T16:51:57Z

FYI, the snapshot testing is still failing on synthetix

agostbiro · 2024-04-04T17:13:05Z

Re the openzeppelin / neptune-mutual failure variability issue: It was investigated in this PR and a follow up task was created so I'd say that's an independent issue from this PR.

AFAICT, this is the last remaining issue to calibrate/fix the variance in benchmarks - based on the tracking issue. Which issue are you referring to?

The issue referenced in the linked PR.

agostbiro · 2024-04-10T13:14:02Z

Merging this with two follow ups:

agostbiro added 5 commits March 28, 2024 18:32

ci: avoid GC-ing errors in benchmark

b455464

Filter out block by number requests for synthetix

f37bc05

Add EJS baseline

582f49c

Update benchmark variance

b1ae938

Update benchmark variance

f55cd51

agostbiro added no changeset needed This PR doesn't require a changeset area:edr labels Apr 3, 2024

agostbiro requested review from fvictorio and Wodann April 3, 2024 17:59

agostbiro self-assigned this Apr 3, 2024

agostbiro had a problem deploying to github-action-benchmark April 3, 2024 17:59 — with GitHub Actions Failure

github-actions bot added the status:ready This issue is ready to be worked on label Apr 3, 2024

agostbiro linked an issue Apr 3, 2024 that may be closed by this pull request

Calibrate performance alert threshold NomicFoundation/edr#337

Closed

fvictorio approved these changes Apr 4, 2024

View reviewed changes

Take median of multiple runs for short scenarios

1e9b3fe

agostbiro had a problem deploying to github-action-benchmark April 4, 2024 15:03 — with GitHub Actions Failure

vercel bot deployed to Preview April 4, 2024 15:05 View deployment

Fix indices when filtering requests for synthetix

0167bfe

agostbiro had a problem deploying to github-action-benchmark April 4, 2024 15:46 — with GitHub Actions Failure

vercel bot deployed to Preview April 4, 2024 15:48 View deployment

Wodann approved these changes Apr 4, 2024

View reviewed changes

Try disabling filtering block requests

1d7fcec

agostbiro temporarily deployed to github-action-benchmark April 4, 2024 17:16 — with GitHub Actions Inactive

vercel bot deployed to Preview April 4, 2024 17:17 View deployment

agostbiro added 3 commits April 5, 2024 19:26

Merge branch 'main' into edr/ci/keep-errors-in-benchmark

e47a0c6

Merge branch 'main' into edr/ci/keep-errors-in-benchmark

e89ebf1

Add max/min failure ratio validation for openzeppelin

1ee9549

vercel bot deployed to Preview April 8, 2024 16:55 View deployment

agostbiro had a problem deploying to github-action-benchmark April 8, 2024 16:55 — with GitHub Actions Failure

agostbiro added 2 commits April 9, 2024 09:24

Fix scenario name check

38662f5

Remove filtering get block by number requests for synthetix

e95bc89

agostbiro had a problem deploying to github-action-benchmark April 9, 2024 07:27 — with GitHub Actions Failure

vercel bot deployed to Preview April 9, 2024 07:29 View deployment

Update snapshot for new openzeppelin scenario

1230e4d

agostbiro temporarily deployed to github-action-benchmark April 10, 2024 12:14 — with GitHub Actions Inactive

vercel bot deployed to Preview April 10, 2024 12:15 View deployment

agostbiro merged commit d8ec4d1 into main Apr 10, 2024
43 checks passed

agostbiro deleted the edr/ci/keep-errors-in-benchmark branch April 10, 2024 13:14

github-actions bot locked as resolved and limited conversation to collaborators Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: calibrate performance alert threshold #5070

ci: calibrate performance alert threshold #5070

agostbiro commented Apr 3, 2024

changeset-bot bot commented Apr 3, 2024 •

edited

Loading

vercel bot commented Apr 3, 2024 •

edited

Loading

Wodann commented Apr 3, 2024

agostbiro commented Apr 4, 2024

agostbiro commented Apr 4, 2024

Wodann commented Apr 4, 2024 •

edited

Loading

agostbiro commented Apr 4, 2024

Wodann commented Apr 4, 2024 •

edited

Loading

agostbiro commented Apr 4, 2024

agostbiro commented Apr 4, 2024

Wodann commented Apr 4, 2024

Wodann left a comment •

edited

Loading

Wodann commented Apr 4, 2024

agostbiro commented Apr 4, 2024

agostbiro commented Apr 10, 2024

ci: calibrate performance alert threshold #5070

ci: calibrate performance alert threshold #5070

Conversation

agostbiro commented Apr 3, 2024

changeset-bot bot commented Apr 3, 2024 • edited Loading

⚠️ No Changeset found

vercel bot commented Apr 3, 2024 • edited Loading

Wodann commented Apr 3, 2024

agostbiro commented Apr 4, 2024

agostbiro commented Apr 4, 2024

Wodann commented Apr 4, 2024 • edited Loading

agostbiro commented Apr 4, 2024

Wodann commented Apr 4, 2024 • edited Loading

agostbiro commented Apr 4, 2024

agostbiro commented Apr 4, 2024

Wodann commented Apr 4, 2024

Wodann left a comment • edited Loading

Choose a reason for hiding this comment

Wodann commented Apr 4, 2024

agostbiro commented Apr 4, 2024

agostbiro commented Apr 10, 2024

changeset-bot bot commented Apr 3, 2024 •

edited

Loading

vercel bot commented Apr 3, 2024 •

edited

Loading

Wodann commented Apr 4, 2024 •

edited

Loading

Wodann commented Apr 4, 2024 •

edited

Loading

Wodann left a comment •

edited

Loading