-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rule Concurrency: Prevent flapping of concurrency #10189
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good given it fixes the immediately problem -- However, I'd prefer if we have 2 fields for the runtime - one without concurrency and one with concurrency that we'd only update once the group finishes evaluating.
I think this will make the runtimes more deterministic and it'll also help us see the difference between expected runtime vs actual runtime (if we expose it as a metric).
The complex part is that we need to do that in Prometheus.
We also discussed offline, and we're going to implement this change directly in Prometheus. We agreed that we should two things:
|
3c5ae76
to
7b03ca1
Compare
Can you please add a changelog entry and make sure you map the newly introduced metric in https://github.com/grafana/mimir/blob/main/pkg/ruler/manager_metrics.go? |
636224f
to
4253538
Compare
Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Approving so that you don't need another review from me, please make sure you update mimir prometheus before merging this PR.
5a418ba
to
f76a168
Compare
Iterates on #8146 The `isGroupAtRisk` function only uses the group's last evaluation time as a metric However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold
Co-authored-by: gotjosh <[email protected]>
f76a168
to
1228428
Compare
* Rule Concurrency: Prevent flapping of concurrency Iterates on #8146 The `isGroupAtRisk` function only uses the group's last evaluation time as a metric However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold * Linting * Use the new `evaluationRuleTimeSum` field from the group * Linting * Add changelog + metric * Apply suggestions from code review Co-authored-by: gotjosh <[email protected]> * Unrevert crypto * Fix typo in changelog --------- Co-authored-by: gotjosh <[email protected]>
* Rule Concurrency: Prevent flapping of concurrency Iterates on #8146 The `isGroupAtRisk` function only uses the group's last evaluation time as a metric However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold * Linting * Use the new `evaluationRuleTimeSum` field from the group * Linting * Add changelog + metric * Apply suggestions from code review Co-authored-by: gotjosh <[email protected]> * Unrevert crypto * Fix typo in changelog --------- Co-authored-by: gotjosh <[email protected]>
* Update mimir-prometheus weekly-r321 This includes: - grafana/mimir-prometheus#807 - grafana/mimir-prometheus#806 Signed-off-by: Oleg Zaytsev <[email protected]> * Rule Concurrency: Prevent flapping of concurrency (#10189) * Rule Concurrency: Prevent flapping of concurrency Iterates on #8146 The `isGroupAtRisk` function only uses the group's last evaluation time as a metric However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold * Linting * Use the new `evaluationRuleTimeSum` field from the group * Linting * Add changelog + metric * Apply suggestions from code review Co-authored-by: gotjosh <[email protected]> * Unrevert crypto * Fix typo in changelog --------- Co-authored-by: gotjosh <[email protected]> --------- Signed-off-by: Oleg Zaytsev <[email protected]> Co-authored-by: Julien Duchesne <[email protected]> Co-authored-by: gotjosh <[email protected]>
What this PR does
Iterates on #8146
The
isGroupAtRisk
function only uses the group's last evaluation time as a metricHowever, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run
In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.