Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruler: cap the number of remote eval retries #10375

Merged
merged 7 commits into from
Jan 9, 2025

Conversation

dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Jan 8, 2025

Problem

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

What this PR does

Introduce a soft rate limit for the retries. The default is 170 retries/sec, which is half the average rate of rule evaluations in clusters at GL. If a retry is above the rate limit, we'd wait (time.Sleep) until it is within the limit, while also not exceeding a 2s backoff. The idea is not to overextend into the next evaluation.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>
Copy link
Contributor

github-actions bot commented Jan 8, 2025

💻 Deploy preview deleted.

Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks, @dimitarvdimitrov !

@dimitarvdimitrov dimitarvdimitrov marked this pull request as draft January 9, 2025 08:27
@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/ruler/bound-eval-retry-rate branch from daa45c5 to 93cf6a1 Compare January 9, 2025 11:17
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title and description seem to apply to a circuit-breaker version.

level.Warn(logger).Log("msg", "failed to remotely evaluate query expression, will retry", "err", err, "retry_delay", retryDelay)

select {
case <-time.After(retryDelay):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing note: I checked up on this:

// As of Go 1.23, the garbage collector can recover unreferenced,
// unstopped timers. 

@dimitarvdimitrov dimitarvdimitrov marked this pull request as ready for review January 9, 2025 13:54
@dimitarvdimitrov dimitarvdimitrov enabled auto-merge (squash) January 9, 2025 13:58
@dimitarvdimitrov dimitarvdimitrov merged commit ffee57d into main Jan 9, 2025
31 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/ruler/bound-eval-retry-rate branch January 9, 2025 14:12
@grafanabot
Copy link
Contributor

The backport to r316 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r316 origin/r316
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r316
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r316

Then, create a pull request where the base branch is r316 and the compare/head branch is backport-10375-to-r316.

@grafanabot
Copy link
Contributor

The backport to r320 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r320 origin/r320
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r320
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r320

Then, create a pull request where the base branch is r320 and the compare/head branch is backport-10375-to-r320.

@grafanabot
Copy link
Contributor

The backport to r321 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r321 origin/r321
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r321
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r321

Then, create a pull request where the base branch is r321 and the compare/head branch is backport-10375-to-r321.

@grafanabot
Copy link
Contributor

The backport to r319 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r319 origin/r319
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r319
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r319

Then, create a pull request where the base branch is r319 and the compare/head branch is backport-10375-to-r319.

dimitarvdimitrov added a commit that referenced this pull request Jan 9, 2025
* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)
dimitarvdimitrov added a commit that referenced this pull request Jan 9, 2025
* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)
dimitarvdimitrov added a commit that referenced this pull request Jan 9, 2025
* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)
dimitarvdimitrov added a commit that referenced this pull request Jan 9, 2025
* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)
dimitarvdimitrov added a commit that referenced this pull request Jan 9, 2025
* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)

Co-authored-by: Dimitar Dimitrov <[email protected]>
dimitarvdimitrov added a commit that referenced this pull request Jan 9, 2025
* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)
dimitarvdimitrov added a commit that referenced this pull request Jan 10, 2025
* ruler: cap the number of remote eval retries (#10375)

* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)

* Fix flag

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
dimitarvdimitrov added a commit that referenced this pull request Jan 10, 2025
* ruler: cap the number of remote eval retries (#10375)

* ruler: cap the number of remote eval retries

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%.

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Add CHANGELOG.md entry

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Fix a totally arbitrary stupid linter rule

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Use a CB instead of a rate limtier

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Revert "Use a CB instead of a rate limtier"

This reverts commit b07366f.

* Don't abort retries if we're over the rate limit

Signed-off-by: Dimitar Dimitrov <[email protected]>

* Cancel reservation when context expires

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
(cherry picked from commit ffee57d)

* Fix flag

Signed-off-by: Dimitar Dimitrov <[email protected]>

---------

Signed-off-by: Dimitar Dimitrov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants