Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIPE-631 Backoff exponentially when job acquisition fails #3153

Merged
merged 4 commits into from
Jan 9, 2025

Conversation

moskyb
Copy link
Contributor

@moskyb moskyb commented Jan 9, 2025

Description

During recent incidents, we've run into situations where when in acquire-job mode, the agent can tend to be a bit of a thundering herd - we retry every 3 seconds, and don't jitter. This means that if a lot of jobs are acquired all at once, and Buildkite is having a bad time, it'll tell them all to go away all at once, and then they'll all come back 3 seconds later, with the load all happening all at once.

This PR updates the retry strategy the agent uses when acquiring jobs, making the following changes:

  • The basic strategy used in roko has been updated from

    roko.WithMaxAttempts(9)
    roko.WithStrategy(roko.Constant(3*time.Second))

    to

    roko.WithMaxAttempts(7)
    roko.WithStrategy(roko.Exponential(2*time.Second, 0))

    effectively, this means that we'll make fewer requests over a longer period of time, with the delay growing exponentially with each successive attempt. Incidentally, the retrier was already documented to use this approach, but it wasn't actually using it.

  • We've added jitter to the retry strategy using roko's new WithJitterRange feature - we jitter on a range of [-1s, 5s]. This means that the wait between each outgoing request will be adjusted by a random amount between -1 and 5 seconds.

  • The retry strategy will now respect server-defined backoff periods, defined in the Retry-After response header. These headers will only be respected for status codes 423 Locked, 429 Too Many Requests and any 5xx status. If the Retry-After header is absent from the repsonse, the agent will retry with the default exponential backoff strategy

  • The agent will no longer retry responses with a 4xx status code that isn't 423 Locked or 429 Too Many Requests

As an ancillary change, the agent wasn't respecting SIGTERMs while in the backoff loop attempting to acquire a job. This is a bug that's unlikely to occur in customer environments, but it was sure annoying when testing these changes locally, so i've added a little listener that cancels the context if it detects that the AgentWorker is stopping.

Context

PIPE-631

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

core/client.go Outdated
roko.WithStrategy(roko.Constant(3*time.Second)),
roko.WithMaxAttempts(9),
roko.WithStrategy(roko.Exponential(2*time.Second, 0)),
roko.WithJitterRange(-time.Second, 5*time.Second),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source: i made it up. happy to change this to something more considered.

@moskyb moskyb force-pushed the expo-backoff-for-acquire branch 2 times, most recently from 5ae65d7 to 3972a3a Compare January 9, 2025 04:08
@moskyb moskyb changed the title Expo backoff for acquire Backoff exponentially when job acquisition fails Jan 9, 2025
@moskyb moskyb force-pushed the expo-backoff-for-acquire branch from 3972a3a to e2fc113 Compare January 9, 2025 06:41
@moskyb moskyb marked this pull request as ready for review January 9, 2025 06:44
@moskyb moskyb changed the title Backoff exponentially when job acquisition fails PIPE-631 Backoff exponentially when job acquisition fails Jan 9, 2025
defer cancel()

// Acquire the job using the ID we were provided. We'll retry as best we can on non 422 error.
// Except for 423 errors, in which we exponentially back off under the direction of the API
// setting the Retry-After header
r := roko.NewRetrier(
roko.WithMaxAttempts(10),
roko.WithStrategy(roko.Constant(3*time.Second)),
roko.WithMaxAttempts(7),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by question: do we want any runtime operational control of this number? i.e. maybe something that's return in the agent registration endpoint?

I know we could build that gear — but the real question is: do we want it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note: this comment made me realise that that comment is now a lie. i'll fix that.

to your actual question, for this particular value (the max retry count for acquire job), i don't really think so. acquire mode is generally intended for customers running their own customer schedulers, and if their acquire fails after 7 (in this case) attempts, they can probably call buildkite-agent start --acquire-job xxx again, so i think the utility of being able to change it serverside would be limited.

in a more general sense, being able to remote-customise certain agent behaviour would be super useful - doing stuff like dynamically slowing down ping intervals or setting an agent-wide rate limit ("you in particular are only allowed to make 1 request per second now", or something) would be pretty cool - i think we'd need a pretty useful initial thing to do that for though.

@moskyb moskyb force-pushed the expo-backoff-for-acquire branch from e2fc113 to 2d6caf7 Compare January 9, 2025 07:32
Copy link
Contributor

@CerealBoy CerealBoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼

I think we could look at having further customisation options in the future, perhaps not for right now though.

Splitting out more of the response codes and handling them more pedantically is a great add here, especially with the comments. that provide the guidance around why these are handled the way they will be.

@moskyb moskyb force-pushed the expo-backoff-for-acquire branch from 2d6caf7 to d52a6d0 Compare January 9, 2025 23:07
@moskyb moskyb merged commit d850b29 into main Jan 9, 2025
1 check passed
@moskyb moskyb deleted the expo-backoff-for-acquire branch January 9, 2025 23:28
This was referenced Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants