Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule Concurrency: Prevent flapping of concurrency #10189

Merged
merged 8 commits into from
Dec 17, 2024

Conversation

julienduchesne
Copy link
Member

@julienduchesne julienduchesne commented Dec 9, 2024

What this PR does

Iterates on #8146

The isGroupAtRisk function only uses the group's last evaluation time as a metric
However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run

In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good given it fixes the immediately problem -- However, I'd prefer if we have 2 fields for the runtime - one without concurrency and one with concurrency that we'd only update once the group finishes evaluating.

I think this will make the runtimes more deterministic and it'll also help us see the difference between expected runtime vs actual runtime (if we expose it as a metric).

The complex part is that we need to do that in Prometheus.

pkg/ruler/rule_concurrency.go Outdated Show resolved Hide resolved
@gotjosh
Copy link
Contributor

gotjosh commented Dec 13, 2024

We also discussed offline, and we're going to implement this change directly in Prometheus. We agreed that we should two things:

  1. Keep an atomic with the total runtime of the rules of the group (not how long the group actually took the evaluate) then we can use that directly in order to determine wether this group is eligible for concurrency or not.
  2. Add a gauge pretty similar to prometheus_rule_group_last_duration_seconds so that we can actually tell when a rule is taking advantage of concurrency.

@julienduchesne julienduchesne force-pushed the julienduchesne/concurrency-no-flapping branch from 3c5ae76 to 7b03ca1 Compare December 13, 2024 17:35
@gotjosh
Copy link
Contributor

gotjosh commented Dec 13, 2024

Can you please add a changelog entry and make sure you map the newly introduced metric in https://github.com/grafana/mimir/blob/main/pkg/ruler/manager_metrics.go?

@julienduchesne julienduchesne force-pushed the julienduchesne/concurrency-no-flapping branch from 636224f to 4253538 Compare December 13, 2024 21:06
@julienduchesne
Copy link
Member Author

Can you please add a changelog entry and make sure you map the newly introduced metric in https://github.com/grafana/mimir/blob/main/pkg/ruler/manager_metrics.go?

Done!

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Approving so that you don't need another review from me, please make sure you update mimir prometheus before merging this PR.

pkg/ruler/manager_metrics.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@julienduchesne julienduchesne force-pushed the julienduchesne/concurrency-no-flapping branch from 5a418ba to f76a168 Compare December 16, 2024 15:44
CHANGELOG.md Outdated Show resolved Hide resolved
julienduchesne and others added 6 commits December 17, 2024 08:55
Iterates on #8146

The `isGroupAtRisk` function only uses the group's last evaluation time as a metric
However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run

In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold
@julienduchesne julienduchesne force-pushed the julienduchesne/concurrency-no-flapping branch from f76a168 to 1228428 Compare December 17, 2024 13:57
@julienduchesne julienduchesne enabled auto-merge (squash) December 17, 2024 14:19
@julienduchesne julienduchesne merged commit 3dc5104 into main Dec 17, 2024
29 checks passed
@julienduchesne julienduchesne deleted the julienduchesne/concurrency-no-flapping branch December 17, 2024 14:35
julienduchesne added a commit that referenced this pull request Dec 17, 2024
* Rule Concurrency: Prevent flapping of concurrency

Iterates on #8146

The `isGroupAtRisk` function only uses the group's last evaluation time as a metric
However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run

In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold

* Linting

* Use the new `evaluationRuleTimeSum` field from the group

* Linting

* Add changelog + metric

* Apply suggestions from code review

Co-authored-by: gotjosh <[email protected]>

* Unrevert crypto

* Fix typo in changelog

---------

Co-authored-by: gotjosh <[email protected]>
julienduchesne added a commit that referenced this pull request Dec 17, 2024
* Rule Concurrency: Prevent flapping of concurrency

Iterates on #8146

The `isGroupAtRisk` function only uses the group's last evaluation time as a metric
However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run

In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold

* Linting

* Use the new `evaluationRuleTimeSum` field from the group

* Linting

* Add changelog + metric

* Apply suggestions from code review

Co-authored-by: gotjosh <[email protected]>

* Unrevert crypto

* Fix typo in changelog

---------

Co-authored-by: gotjosh <[email protected]>
julienduchesne added a commit that referenced this pull request Dec 17, 2024
* Update mimir-prometheus weekly-r321

This includes:
- grafana/mimir-prometheus#807
- grafana/mimir-prometheus#806

Signed-off-by: Oleg Zaytsev <[email protected]>

* Rule Concurrency: Prevent flapping of concurrency (#10189)

* Rule Concurrency: Prevent flapping of concurrency

Iterates on #8146

The `isGroupAtRisk` function only uses the group's last evaluation time as a metric
However, if the concurrency of the group causes the group's eval time to lower to less than the threshold, this will flap between enabling concurrency and disabling it on every run

In this PR, a condition is added to also sum up the last evaluation time of each rule to compare against the threshold

* Linting

* Use the new `evaluationRuleTimeSum` field from the group

* Linting

* Add changelog + metric

* Apply suggestions from code review

Co-authored-by: gotjosh <[email protected]>

* Unrevert crypto

* Fix typo in changelog

---------

Co-authored-by: gotjosh <[email protected]>

---------

Signed-off-by: Oleg Zaytsev <[email protected]>
Co-authored-by: Julien Duchesne <[email protected]>
Co-authored-by: gotjosh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants