ruler: cap the number of remote eval retries #10375

dimitarvdimitrov · 2025-01-08T13:00:25Z

Problem

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.

What this PR does

Introduce a soft rate limit for the retries. The default is 170 retries/sec, which is half the average rate of rule evaluations in clusters at GL. If a retry is above the rate limit, we'd wait (time.Sleep) until it is within the limit, while also not exceeding a 2s backoff. The idea is not to overextend into the next evaluation.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]>

github-actions · 2025-01-08T13:02:03Z

💻 Deploy preview deleted.

Signed-off-by: Dimitar Dimitrov <[email protected]>

tacole02

Looks great! Thanks, @dimitarvdimitrov !

Signed-off-by: Dimitar Dimitrov <[email protected]>

This reverts commit b07366f.

Signed-off-by: Dimitar Dimitrov <[email protected]>

bboreham

The PR title and description seem to apply to a circuit-breaker version.

bboreham · 2025-01-09T13:09:17Z

pkg/ruler/remotequerier.go

+		level.Warn(logger).Log("msg", "failed to remotely evaluate query expression, will retry", "err", err, "retry_delay", retryDelay)
+
+		select {
+		case <-time.After(retryDelay):


Reviewing note: I checked up on this:

// As of Go 1.23, the garbage collector can recover unreferenced, // unstopped timers.

grafanabot · 2025-01-09T14:16:52Z

The backport to r316 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r316 origin/r316
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r316
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r316

Then, create a pull request where the base branch is r316 and the compare/head branch is backport-10375-to-r316.

grafanabot · 2025-01-09T14:21:51Z

The backport to r320 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r320 origin/r320
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r320
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r320

Then, create a pull request where the base branch is r320 and the compare/head branch is backport-10375-to-r320.

grafanabot · 2025-01-09T14:21:52Z

The backport to r321 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r321 origin/r321
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r321
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r321

Then, create a pull request where the base branch is r321 and the compare/head branch is backport-10375-to-r321.

grafanabot · 2025-01-09T14:21:56Z

The backport to r319 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r319 origin/r319
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r319
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r319

Then, create a pull request where the base branch is r319 and the compare/head branch is backport-10375-to-r319.

* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)

* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d) Co-authored-by: Dimitar Dimitrov <[email protected]>

* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)

* ruler: cap the number of remote eval retries (#10375) * ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d) * Fix flag Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]>

dimitarvdimitrov requested review from a team and tacole02 as code owners January 8, 2025 13:00

dimitarvdimitrov added the component/ruler label Jan 8, 2025

dimitarvdimitrov added 2 commits January 8, 2025 15:02

Add CHANGELOG.md entry

59e2c49

Signed-off-by: Dimitar Dimitrov <[email protected]>

Fix a totally arbitrary stupid linter rule

b020a8f

Signed-off-by: Dimitar Dimitrov <[email protected]>

tacole02 approved these changes Jan 8, 2025

View reviewed changes

dimitarvdimitrov marked this pull request as draft January 9, 2025 08:27

dimitarvdimitrov added 3 commits January 9, 2025 12:56

Use a CB instead of a rate limtier

b07366f

Signed-off-by: Dimitar Dimitrov <[email protected]>

Revert "Use a CB instead of a rate limtier"

f19496b

This reverts commit b07366f.

Don't abort retries if we're over the rate limit

93cf6a1

Signed-off-by: Dimitar Dimitrov <[email protected]>

dimitarvdimitrov force-pushed the dimitar/ruler/bound-eval-retry-rate branch from daa45c5 to 93cf6a1 Compare January 9, 2025 11:17

Cancel reservation when context expires

2064f52

Signed-off-by: Dimitar Dimitrov <[email protected]>

bboreham approved these changes Jan 9, 2025

View reviewed changes

dimitarvdimitrov marked this pull request as ready for review January 9, 2025 13:54

dimitarvdimitrov enabled auto-merge (squash) January 9, 2025 13:58

dimitarvdimitrov merged commit ffee57d into main Jan 9, 2025
31 checks passed

dimitarvdimitrov deleted the dimitar/ruler/bound-eval-retry-rate branch January 9, 2025 14:12

dimitarvdimitrov added backport r324 backport r316 labels Jan 9, 2025

grafanabot mentioned this pull request Jan 9, 2025

[r324] ruler: cap the number of remote eval retries #10386

Merged

grafanabot added the backport-failed label Jan 9, 2025

dimitarvdimitrov added backport r319 backport r321 backport r320 labels Jan 9, 2025

dimitarvdimitrov mentioned this pull request Jan 9, 2025

r316: ruler: cap the number of remote eval retries #10387

Merged

dimitarvdimitrov mentioned this pull request Jan 9, 2025

r319: ruler: cap the number of remote eval retries #10388

Closed

dimitarvdimitrov mentioned this pull request Jan 9, 2025

r320: ruler: cap the number of remote eval retries #10389

Merged

dimitarvdimitrov mentioned this pull request Jan 9, 2025

r321: ruler: cap the number of remote eval retries #10390

Merged

dimitarvdimitrov mentioned this pull request Jan 9, 2025

ruler: use max_retries_rate flag #10393

Merged

dimitarvdimitrov mentioned this pull request Jan 10, 2025

ruler: increase retries backoff limit to 1m #10403

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ruler: cap the number of remote eval retries #10375

ruler: cap the number of remote eval retries #10375

dimitarvdimitrov commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading

tacole02 left a comment

bboreham left a comment

bboreham Jan 9, 2025

grafanabot commented Jan 9, 2025

grafanabot commented Jan 9, 2025

grafanabot commented Jan 9, 2025

grafanabot commented Jan 9, 2025

ruler: cap the number of remote eval retries #10375

ruler: cap the number of remote eval retries #10375

Conversation

dimitarvdimitrov commented Jan 8, 2025 • edited Loading

Problem

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

github-actions bot commented Jan 8, 2025 • edited Loading

tacole02 left a comment

Choose a reason for hiding this comment

bboreham left a comment

Choose a reason for hiding this comment

bboreham Jan 9, 2025

Choose a reason for hiding this comment

grafanabot commented Jan 9, 2025

grafanabot commented Jan 9, 2025

grafanabot commented Jan 9, 2025

grafanabot commented Jan 9, 2025

dimitarvdimitrov commented Jan 8, 2025 •

edited

Loading

github-actions bot commented Jan 8, 2025 •

edited

Loading