-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PIPE-631 Backoff exponentially when job acquisition fails #3153
Conversation
core/client.go
Outdated
roko.WithStrategy(roko.Constant(3*time.Second)), | ||
roko.WithMaxAttempts(9), | ||
roko.WithStrategy(roko.Exponential(2*time.Second, 0)), | ||
roko.WithJitterRange(-time.Second, 5*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source: i made it up. happy to change this to something more considered.
5ae65d7
to
3972a3a
Compare
3972a3a
to
e2fc113
Compare
defer cancel() | ||
|
||
// Acquire the job using the ID we were provided. We'll retry as best we can on non 422 error. | ||
// Except for 423 errors, in which we exponentially back off under the direction of the API | ||
// setting the Retry-After header | ||
r := roko.NewRetrier( | ||
roko.WithMaxAttempts(10), | ||
roko.WithStrategy(roko.Constant(3*time.Second)), | ||
roko.WithMaxAttempts(7), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by question: do we want any runtime operational control of this number? i.e. maybe something that's return in the agent registration endpoint?
I know we could build that gear — but the real question is: do we want it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note: this comment made me realise that that comment is now a lie. i'll fix that.
to your actual question, for this particular value (the max retry count for acquire job), i don't really think so. acquire mode is generally intended for customers running their own customer schedulers, and if their acquire fails after 7 (in this case) attempts, they can probably call buildkite-agent start --acquire-job xxx
again, so i think the utility of being able to change it serverside would be limited.
in a more general sense, being able to remote-customise certain agent behaviour would be super useful - doing stuff like dynamically slowing down ping intervals or setting an agent-wide rate limit ("you in particular are only allowed to make 1 request per second now", or something) would be pretty cool - i think we'd need a pretty useful initial thing to do that for though.
e2fc113
to
2d6caf7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏼
I think we could look at having further customisation options in the future, perhaps not for right now though.
Splitting out more of the response codes and handling them more pedantically is a great add here, especially with the comments. that provide the guidance around why these are handled the way they will be.
2d6caf7
to
d52a6d0
Compare
Description
During recent incidents, we've run into situations where when in acquire-job mode, the agent can tend to be a bit of a thundering herd - we retry every 3 seconds, and don't jitter. This means that if a lot of jobs are acquired all at once, and Buildkite is having a bad time, it'll tell them all to go away all at once, and then they'll all come back 3 seconds later, with the load all happening all at once.
This PR updates the retry strategy the agent uses when acquiring jobs, making the following changes:
The basic strategy used in
roko
has been updated fromto
effectively, this means that we'll make fewer requests over a longer period of time, with the delay growing exponentially with each successive attempt. Incidentally, the retrier was already documented to use this approach, but it wasn't actually using it.
We've added jitter to the retry strategy using roko's new
WithJitterRange
feature - we jitter on a range of[-1s, 5s]
. This means that the wait between each outgoing request will be adjusted by a random amount between -1 and 5 seconds.The retry strategy will now respect server-defined backoff periods, defined in the
Retry-After
response header. These headers will only be respected for status codes423 Locked
,429 Too Many Requests
and any 5xx status. If theRetry-After
header is absent from the repsonse, the agent will retry with the default exponential backoff strategyThe agent will no longer retry responses with a 4xx status code that isn't
423 Locked
or429 Too Many Requests
As an ancillary change, the agent wasn't respecting SIGTERMs while in the backoff loop attempting to acquire a job. This is a bug that's unlikely to occur in customer environments, but it was sure annoying when testing these changes locally, so i've added a little listener that cancels the context if it detects that the
AgentWorker
is stopping.Context
PIPE-631
Testing
go test ./...
). Buildkite employees may check this if the pipeline has run automatically.go fmt ./...
)