-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: calibrate performance alert threshold #5070
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
I've asked this before but I'm not sure whether you actually ran the experiment: did you ever try running scenarios in Rust by using the |
No, how do you think that'd help? |
Regarding the neptune-mutual variance it's also useful to know that it has a varying number of failures between runs. While the fast outlier mentioned above had an approximate number of failures that slower runs had as well, it's possible that the particular transactions skipped take up more time. |
Your hypothesis for variability was the JS garbage collector. We tried to mitigate that using configuration variables, but were unable to remove variability completely. By directly using the |
We've confirmed that for
I don't think it would solve the two issues from above and imo it's important to run the benchmarks the same way users do. If we want to bring the max/min ratio under 110% and eliminate false positives, we could:
I think the first fix is low impact, so I'm not inclined to do it until it gets too annoying. The second fix would be higher impact, but otoh |
For me there are two fundamental flaws that need to be solved:
This signals that the benchmark is non-deterministic, which it shouldn't be. It either signals a bug in EDR or I'm missing something in my mental model that would rule out deterministic execution.
I like your proposed solution for reducing the variance in
For short benchmarks you always run it N times and just store the aggregate to account for variance. I'd be against the other proposal, for the 2nd flaw that I listed above:
I'd be in favour of figuring out what's the underlaying cause and fix the bug / correct my mental model about provider calls being deterministic (except for timestamps). Even with timestamps affecting the block hashes, I'd expect the success/failure rates to be the same after a scenario has been recorded. |
Re the |
Re the variance in short scenarios: I added a quick and dirty fix to run short scenarios repeatedly and report the median. It's not the sum to keep results comparable. As a result I reduced the threshold to 110% which is conservative. |
AFAICT, this is the last remaining issue to calibrate/fix the variance in benchmarks - based on the tracking issue. Which issue are you referring to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a result I reduced the threshold to 110% which is conservative.
Nice 🎉
To me that's an acceptable threshold until we fix/figure out the determinism issue. So if we have a high priority ticket to investigate that, this looks good to me.
Also, I just wanted to share my general thinking on this: I'd rather have some false positive regression reports that require a manual look at the benchmark results than allowing false negative regressions to slip through.
FYI, the snapshot testing is still failing on |
The issue referenced in the linked PR. |
Merging this with two follow ups: |
This PR contains the following changes:
eth_getBlockByNumber
forsynthetix
as it was causing GC pressure and high max/min ratio (138%). The max/min rationsynthetix
is now down to 101%.It is possible that there will be false positives even with a threshold of 115% for two reasons:
safe-contracts
is 116.7%, but ignoring the slowest and fastest runs, it's 112.2% and it's below 110% for all other scenarios exceptneptune-mutual
.neptune-mutual
is 137.5%. This is very surprising, as on a previous variance measurement run of 30 iterations the max/min ratio onneptune-mutual
was <105%. Ignoring the slowest and fastest runs the max/min ratio ofneptune-mutual
is 102.8%, so the 137.5% ratio is a wild outlier.I didn't want to set a higher alert threshold than 115% to accommodate the two cases above, since that would render the alert threshold useless for other scenarios, so if we get a regression alert, we should keep this in mind and consult the benchmark-variance notebook to decide if there is an actual regression.