-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ruler: cap the number of remote eval retries #10375
Conversation
The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]>
💻 Deploy preview deleted. |
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks, @dimitarvdimitrov !
Signed-off-by: Dimitar Dimitrov <[email protected]>
This reverts commit b07366f.
Signed-off-by: Dimitar Dimitrov <[email protected]>
daa45c5
to
93cf6a1
Compare
Signed-off-by: Dimitar Dimitrov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title and description seem to apply to a circuit-breaker version.
level.Warn(logger).Log("msg", "failed to remotely evaluate query expression, will retry", "err", err, "retry_delay", retryDelay) | ||
|
||
select { | ||
case <-time.After(retryDelay): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing note: I checked up on this:
// As of Go 1.23, the garbage collector can recover unreferenced,
// unstopped timers.
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r316 origin/r316
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r316
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r316 Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r320 origin/r320
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r320
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r320 Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r321 origin/r321
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r321
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r321 Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-10375-to-r319 origin/r319
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x ffee57de406dd651dccf104db25ee93ec46a3c0e
# Push it to GitHub
git push --set-upstream origin backport-10375-to-r319
git switch main
# Remove the local backport branch
git branch -D backport-10375-to-r319 Then, create a pull request where the |
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d) Co-authored-by: Dimitar Dimitrov <[email protected]>
* ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d)
* ruler: cap the number of remote eval retries (#10375) * ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d) * Fix flag Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]>
* ruler: cap the number of remote eval retries (#10375) * ruler: cap the number of remote eval retries The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that. This PR changes that so that the whole process doesn't retry more than a fixed number of queries/sec. I chose 170 because at GL the average evals/sec is 340 per ruler. This would retry about half of the rules on average. _On average_ that should increase query load by 50%. Signed-off-by: Dimitar Dimitrov <[email protected]> * Add CHANGELOG.md entry Signed-off-by: Dimitar Dimitrov <[email protected]> * Fix a totally arbitrary stupid linter rule Signed-off-by: Dimitar Dimitrov <[email protected]> * Use a CB instead of a rate limtier Signed-off-by: Dimitar Dimitrov <[email protected]> * Revert "Use a CB instead of a rate limtier" This reverts commit b07366f. * Don't abort retries if we're over the rate limit Signed-off-by: Dimitar Dimitrov <[email protected]> * Cancel reservation when context expires Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]> (cherry picked from commit ffee57d) * Fix flag Signed-off-by: Dimitar Dimitrov <[email protected]> --------- Signed-off-by: Dimitar Dimitrov <[email protected]>
Problem
The retries happen more aggressively than actual evaluations. With the current setup an error spike results in 3x the query rate - initial query, and two retries fairly quickly 100ms & 200ms after that.
What this PR does
Introduce a soft rate limit for the retries. The default is 170 retries/sec, which is half the average rate of rule evaluations in clusters at GL. If a retry is above the rate limit, we'd wait (
time.Sleep
) until it is within the limit, while also not exceeding a 2s backoff. The idea is not to overextend into the next evaluation.Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.