PIPE-631 Backoff exponentially when job acquisition fails #3153

moskyb · 2025-01-09T03:21:19Z

Description

During recent incidents, we've run into situations where when in acquire-job mode, the agent can tend to be a bit of a thundering herd - we retry every 3 seconds, and don't jitter. This means that if a lot of jobs are acquired all at once, and Buildkite is having a bad time, it'll tell them all to go away all at once, and then they'll all come back 3 seconds later, with the load all happening all at once.

This PR updates the retry strategy the agent uses when acquiring jobs, making the following changes:

The basic strategy used in roko has been updated from
```
roko.WithMaxAttempts(9)
roko.WithStrategy(roko.Constant(3*time.Second))
```
to
```
roko.WithMaxAttempts(7)
roko.WithStrategy(roko.Exponential(2*time.Second, 0))
```
effectively, this means that we'll make fewer requests over a longer period of time, with the delay growing exponentially with each successive attempt. Incidentally, the retrier was already documented to use this approach, but it wasn't actually using it.
We've added jitter to the retry strategy using roko's new WithJitterRange feature - we jitter on a range of [-1s, 5s]. This means that the wait between each outgoing request will be adjusted by a random amount between -1 and 5 seconds.
The retry strategy will now respect server-defined backoff periods, defined in the Retry-After response header. These headers will only be respected for status codes 423 Locked, 429 Too Many Requests and any 5xx status. If the Retry-After header is absent from the repsonse, the agent will retry with the default exponential backoff strategy
The agent will no longer retry responses with a 4xx status code that isn't 423 Locked or 429 Too Many Requests

As an ancillary change, the agent wasn't respecting SIGTERMs while in the backoff loop attempting to acquire a job. This is a bug that's unlikely to occur in customer environments, but it was sure annoying when testing these changes locally, so i've added a little listener that cancels the context if it detects that the AgentWorker is stopping.

Context

PIPE-631

Testing

Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
Code is formatted (with go fmt ./...)

moskyb · 2025-01-09T03:23:05Z

core/client.go

-		roko.WithStrategy(roko.Constant(3*time.Second)),
+		roko.WithMaxAttempts(9),
+		roko.WithStrategy(roko.Exponential(2*time.Second, 0)),
+		roko.WithJitterRange(-time.Second, 5*time.Second),


source: i made it up. happy to change this to something more considered.

keithpitt · 2025-01-09T07:01:32Z

core/client.go

 	defer cancel()

 	// Acquire the job using the ID we were provided. We'll retry as best we can on non 422 error.
 	// Except for 423 errors, in which we exponentially back off under the direction of the API
 	// setting the Retry-After header
 	r := roko.NewRetrier(
-		roko.WithMaxAttempts(10),
-		roko.WithStrategy(roko.Constant(3*time.Second)),
+		roko.WithMaxAttempts(7),


Drive by question: do we want any runtime operational control of this number? i.e. maybe something that's return in the agent registration endpoint?

I know we could build that gear — but the real question is: do we want it?

side note: this comment made me realise that that comment is now a lie. i'll fix that.

to your actual question, for this particular value (the max retry count for acquire job), i don't really think so. acquire mode is generally intended for customers running their own customer schedulers, and if their acquire fails after 7 (in this case) attempts, they can probably call buildkite-agent start --acquire-job xxx again, so i think the utility of being able to change it serverside would be limited.

in a more general sense, being able to remote-customise certain agent behaviour would be super useful - doing stuff like dynamically slowing down ping intervals or setting an agent-wide rate limit ("you in particular are only allowed to make 1 request per second now", or something) would be pretty cool - i think we'd need a pretty useful initial thing to do that for though.

CerealBoy

LGTM 👍🏼

I think we could look at having further customisation options in the future, perhaps not for right now though.

Splitting out more of the response codes and handling them more pedantically is a great add here, especially with the comments. that provide the guidance around why these are handled the way they will be.

moskyb commented Jan 9, 2025

View reviewed changes

moskyb force-pushed the expo-backoff-for-acquire branch 2 times, most recently from 5ae65d7 to 3972a3a Compare January 9, 2025 04:08

moskyb changed the title ~~Expo backoff for acquire~~ Backoff exponentially when job acquisition fails Jan 9, 2025

moskyb force-pushed the expo-backoff-for-acquire branch from 3972a3a to e2fc113 Compare January 9, 2025 06:41

moskyb marked this pull request as ready for review January 9, 2025 06:44

moskyb changed the title ~~Backoff exponentially when job acquisition fails~~ PIPE-631 Backoff exponentially when job acquisition fails Jan 9, 2025

keithpitt reviewed Jan 9, 2025

View reviewed changes

moskyb force-pushed the expo-backoff-for-acquire branch from e2fc113 to 2d6caf7 Compare January 9, 2025 07:32

CerealBoy approved these changes Jan 9, 2025

View reviewed changes

moskyb added 4 commits January 10, 2025 10:06

Respect agent stop requests when retrying job acquisition

e2231e8

Update roko to v1.3.0

9f57ea3

Switch Job Acquire to use expo backoff, use jitter

55f6542

Respect Retry-After headers for 429, 423 and all 5xx status codes

d52a6d0

moskyb force-pushed the expo-backoff-for-acquire branch from 2d6caf7 to d52a6d0 Compare January 9, 2025 23:07

moskyb merged commit d850b29 into main Jan 9, 2025
1 check passed

moskyb deleted the expo-backoff-for-acquire branch January 9, 2025 23:28

This was referenced Jan 9, 2025

Bump version + changelog for v3.90.0 #3154

Merged

Roko v1.3.1 #3157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PIPE-631 Backoff exponentially when job acquisition fails #3153

PIPE-631 Backoff exponentially when job acquisition fails #3153

moskyb commented Jan 9, 2025 •

edited

Loading

moskyb Jan 9, 2025

keithpitt Jan 9, 2025

moskyb Jan 9, 2025

CerealBoy left a comment

PIPE-631 Backoff exponentially when job acquisition fails #3153

PIPE-631 Backoff exponentially when job acquisition fails #3153

Conversation

moskyb commented Jan 9, 2025 • edited Loading

Description

Context

Testing

moskyb Jan 9, 2025

Choose a reason for hiding this comment

keithpitt Jan 9, 2025

Choose a reason for hiding this comment

moskyb Jan 9, 2025

Choose a reason for hiding this comment

CerealBoy left a comment

Choose a reason for hiding this comment

moskyb commented Jan 9, 2025 •

edited

Loading