Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: calibrate performance alert threshold #5070

Merged
merged 14 commits into from
Apr 10, 2024

Conversation

agostbiro
Copy link
Member

This PR contains the following changes:

  • Filter out eth_getBlockByNumber for synthetix as it was causing GC pressure and high max/min ratio (138%). The max/min ration synthetix is now down to 101%.
  • Increase the alert threshold to 115%. This is needed due to higher variance than previously measured with fever iterations for small scenarios.

It is possible that there will be false positives even with a threshold of 115% for two reasons:

  • The max/min ratio on safe-contracts is 116.7%, but ignoring the slowest and fastest runs, it's 112.2% and it's below 110% for all other scenarios except neptune-mutual.
  • The max/min ratio on neptune-mutual is 137.5%. This is very surprising, as on a previous variance measurement run of 30 iterations the max/min ratio on neptune-mutual was <105%. Ignoring the slowest and fastest runs the max/min ratio of neptune-mutual is 102.8%, so the 137.5% ratio is a wild outlier.

I didn't want to set a higher alert threshold than 115% to accommodate the two cases above, since that would render the alert threshold useless for other scenarios, so if we get a regression alert, we should keep this in mind and consult the benchmark-variance notebook to decide if there is an actual regression.

@agostbiro agostbiro added no changeset needed This PR doesn't require a changeset area:edr labels Apr 3, 2024
@agostbiro agostbiro requested review from fvictorio and Wodann April 3, 2024 17:59
@agostbiro agostbiro self-assigned this Apr 3, 2024
Copy link

changeset-bot bot commented Apr 3, 2024

⚠️ No Changeset found

Latest commit: 1230e4d

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

vercel bot commented Apr 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
hardhat ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 10, 2024 0:15am

@agostbiro agostbiro had a problem deploying to github-action-benchmark April 3, 2024 17:59 — with GitHub Actions Failure
@github-actions github-actions bot added the status:ready This issue is ready to be worked on label Apr 3, 2024
@agostbiro agostbiro linked an issue Apr 3, 2024 that may be closed by this pull request
@Wodann
Copy link
Member

Wodann commented Apr 3, 2024

I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the Provider API of edr_napi? Was the variance still an issue in that case?

@agostbiro
Copy link
Member Author

I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the Provider API of edr_napi? Was the variance still an issue in that case?

No, how do you think that'd help?

@agostbiro
Copy link
Member Author

Regarding the neptune-mutual variance it's also useful to know that it has a varying number of failures between runs. While the fast outlier mentioned above had an approximate number of failures that slower runs had as well, it's possible that the particular transactions skipped take up more time.

@Wodann
Copy link
Member

Wodann commented Apr 4, 2024

I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the Provider API of edr_napi? Was the variance still an issue in that case?

No, how do you think that'd help?

Your hypothesis for variability was the JS garbage collector. We tried to mitigate that using configuration variables, but were unable to remove variability completely. By directly using the Provider API of edr_napi, you completely circumvent the GC (and JS), instead running the tests in pure Rust. The only thing we'd not incorporate in the benchmarks is the marshalling of strings, but if it removes our reliability problems it seems like a trade-off to consider.

@agostbiro
Copy link
Member Author

Your hypothesis for variability was the JS garbage collector. We tried to mitigate that using configuration variables, but were unable to remove variability completely.

We've confirmed that for synthetix the cause of the variance is GC and we have the max/min ratio down to 101% now for that scenario. We have two scenarios left with observed max/min ratio over 110%:

  1. safe-contracts which takes very short time (~3.5s) so higher variance is to be expected.
  2. neptune-mutual where we observed one outlier that was significantly faster than the rest. Ignoring this outlier the max/min ratio was 102.8%. We know that there is variability in which RPC calls fail for neptune-mutual so I think the most likely reason for the outlier that significant work was skipped due to a rare failure.

By directly using the Provider API of edr_napi, you completely circumvent the GC (and JS), instead running the tests in pure Rust. The only thing we'd not incorporate in the benchmarks is the marshalling of strings, but if it removes our reliability problems it seems like a trade-off to consider.

I don't think it would solve the two issues from above and imo it's important to run the benchmarks the same way users do. If we want to bring the max/min ratio under 110% and eliminate false positives, we could:

  1. Run very short scenarios multiple time for each benchmark run and take the median.
  2. Exclude neptune-mutual and openzeppelin where there is variability in which RPC calls fail.

I think the first fix is low impact, so I'm not inclined to do it until it gets too annoying. The second fix would be higher impact, but otoh neptune-mutual and openzeppelin are very useful to test certain performance aspects (large state, deep reverts), so I'm reluctant to remove these.

@Wodann
Copy link
Member

Wodann commented Apr 4, 2024

For me there are two fundamental flaws that need to be solved:

We know that there is variability in which RPC calls fail for neptune-mutual so I think the most likely reason for the outlier that significant work was skipped due to a rare failure.

This signals that the benchmark is non-deterministic, which it shouldn't be. It either signals a bug in EDR or I'm missing something in my mental model that would rule out deterministic execution.

  1. Ramping up the tolerance to 115% or excluding (parts of) scenarios seems like hacks/cherry-picking that reduce the usefulness of the benchmarks. 115% allows large regressions to go unnoticed and excluding scenarios makes us completely blind to their performance.

I like your proposed solution for reducing the variance in safe-contracts:

Run very short scenarios multiple time for each benchmark run and take the median.

For short benchmarks you always run it N times and just store the aggregate to account for variance.

I'd be against the other proposal, for the 2nd flaw that I listed above:

Exclude neptune-mutual and openzeppelin where there is variability in which RPC calls fail.

I'd be in favour of figuring out what's the underlaying cause and fix the bug / correct my mental model about provider calls being deterministic (except for timestamps). Even with timestamps affecting the block hashes, I'd expect the success/failure rates to be the same after a scenario has been recorded.

@agostbiro
Copy link
Member Author

Re the openzeppelin / neptune-mutual failure variability issue: It was investigated in this PR and a follow up task was created so I'd say that's an independent issue from this PR.

@agostbiro
Copy link
Member Author

Re the variance in short scenarios: I added a quick and dirty fix to run short scenarios repeatedly and report the median. It's not the sum to keep results comparable. As a result I reduced the threshold to 110% which is conservative.

@Wodann
Copy link
Member

Wodann commented Apr 4, 2024

Re the openzeppelin / neptune-mutual failure variability issue: It was investigated in this PR and a follow up task was created so I'd say that's an independent issue from this PR.

AFAICT, this is the last remaining issue to calibrate/fix the variance in benchmarks - based on the tracking issue. Which issue are you referring to?

Copy link
Member

@Wodann Wodann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a result I reduced the threshold to 110% which is conservative.

Nice 🎉

To me that's an acceptable threshold until we fix/figure out the determinism issue. So if we have a high priority ticket to investigate that, this looks good to me.

Also, I just wanted to share my general thinking on this: I'd rather have some false positive regression reports that require a manual look at the benchmark results than allowing false negative regressions to slip through.

@Wodann
Copy link
Member

Wodann commented Apr 4, 2024

FYI, the snapshot testing is still failing on synthetix

@agostbiro
Copy link
Member Author

Re the openzeppelin / neptune-mutual failure variability issue: It was investigated in this PR and a follow up task was created so I'd say that's an independent issue from this PR.

AFAICT, this is the last remaining issue to calibrate/fix the variance in benchmarks - based on the tracking issue. Which issue are you referring to?

The issue referenced in the linked PR.

@agostbiro
Copy link
Member Author

@agostbiro agostbiro merged commit d8ec4d1 into main Apr 10, 2024
43 checks passed
@agostbiro agostbiro deleted the edr/ci/keep-errors-in-benchmark branch April 10, 2024 13:14
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area:edr no changeset needed This PR doesn't require a changeset status:ready This issue is ready to be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calibrate performance alert threshold
3 participants