-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add configurable timeout #36
base: main
Are you sure you want to change the base?
Conversation
f8e8fb0
to
4807429
Compare
4807429
to
a53729e
Compare
src/plugin/task-runner.go
Outdated
if err != nil { | ||
_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm discarding this error because we're already erroring here. It's a bit late in the afternoon and I don't know how to handle this in an idiomatic way with my frazzled brain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it for tomorrow/next week in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this: the two messages may not be accurate together.
Some ideas/thoughts:
- Do we know if a timeout is considered a failure/error?
- If it's a failure, can we check the Failure contents and return a
fmt.Errorf
using the newly-added message, or the generic. - If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a
fmt.Error
? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design) - This could all be answered with: What happens when a timeout occurs in this
WaitForCompletion
function?
- If it's a failure, can we check the Failure contents and return a
- Change up the messaging entirely (coward's way out 😤)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this: the two messages may not be accurate together.
Some ideas/thoughts:
Do we know if a timeout is considered a failure/error?
If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly->added message, or the generic.
If timeouts are not a failure, would a concurrent waiter function be suitable here that >returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it >just sounds like dookie design)
This could all be answered with: What happens when a timeout occurs in this WaitForCompletion >function?
Change up the messaging entirely (coward's way out 😤)
Yeah, so a few things that ultimately change how we need to approach this:
- Timeouts do not return any actual ECS error messages. Since the type that interacts with them is actually divorced from the infra side of things we only get error details in scenario that something goes wrong upstream - a task is unable to start due to capacity provider config, for instance.
- Instead we get an error returned:
fmt.Errorf("exceeded max wait time for TasksStopped waiter")
(ref:go/pkg/mod/github.com/aws/aws-sdk-go-v2/service/[email protected]/api_op_DescribeTasks.go:569
). - In fact, you cannot expect a
*DescribeTasksOutput
in any circumstance where the wait method returns an error. So, we should assume that errors are actually going to be issues encountered in the method itself, rather than anything upstream. The inverse is only partially true, we are going to have cases where the task succeeds end-to-end, and others where aDescribeTasksOutput.Failure
array is populated. - I'm thinking we can use this as a sentinel error to annotate the build in this specific case, and we should annotate with any failure messages that are returned in the case where the return is successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plugin.yml will need to be updated so the property is available through the plugin config.
It felt like I had forgotten something. |
src/plugin/task-runner.go
Outdated
if err != nil { | ||
_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this: the two messages may not be accurate together.
Some ideas/thoughts:
- Do we know if a timeout is considered a failure/error?
- If it's a failure, can we check the Failure contents and return a
fmt.Errorf
using the newly-added message, or the generic. - If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a
fmt.Error
? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design) - This could all be answered with: What happens when a timeout occurs in this
WaitForCompletion
function?
- If it's a failure, can we check the Failure contents and return a
- Change up the messaging entirely (coward's way out 😤)
60f4f80
to
c6dd545
Compare
dfe100e
to
b6d57d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! It should be a matter of adding a test to ensure behaviour of the timeout interaction and we're good to close.
src/plugin/task-runner.go
Outdated
err := buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout (%d seconds)", config.TimeOut), "error", "ecs-task-runner") | ||
if err != nil { | ||
return fmt.Errorf("failed to annotate buildkite with task timeout failure: %w", err) | ||
} | ||
} | ||
bkerr := buildKiteAgent.Annotate(ctx, fmt.Sprintf("failed to wait for task completion: %v\n", err), "error", "ecs-task-runner") | ||
if bkerr != nil { | ||
return fmt.Errorf("failed to annotate buildkite with task wait failure: %w, annotation error: %w", err, bkerr) | ||
} | ||
} else if len(result.Failures) > 0 { | ||
// There is still a scenario where the task could return failures but this isn't handled by the waiter | ||
// This is due to the waiter only returning errors in scenarios where there are issues querying the task | ||
// or scheduling the task. For a list of the Failures that can be returned in this case, see: | ||
// https://docs.aws.amazon.com/AmazonECS/latest/developerguide/api_failures_messages.html | ||
// specifically, under the `DescribeTasks` API. | ||
err := buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully: %v", result.Failures[0]), "error", "ecs-task-runner") | ||
if err != nil { | ||
return fmt.Errorf("failed to annotate buildkite with task failure: %w", err) | ||
} | ||
return fmt.Errorf("task did not complete successfully: %v", result.Failures[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is MUCH better 🙌 . There's some new error-handling logic, and I don't believe our current test suite handles that.
Can we look at adding a test for this? Methinks we should be able to mock a return to force the timeout message. This will need to be a new test function, and we can skip the table test since it's only one case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circling back to this, I only got some time at the end of the day to take a peek.
It looks like testing this set of error handling might not be super easy to do currently. We will probably need to add some modifications to the definition of TaskRunner
to allow for us to mock the waiter interface.
I'll take a look at that tomorrow if I'm on deck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sweat. Happy to put our heads together on this too.
8891615
to
88cc997
Compare
88cc997
to
470f2a8
Compare
|
1 similar comment
|
}}, nil | ||
}, | ||
}, | ||
"slowpoke": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is beanz, or slowpoke?
This may impact the tests since the strings match
Purpose
This PR adds an optional member to the Config struct used for passing arguments to the plugin.
This member configures how long the waiter will wait before returning a timeout.
Context
This is part of CSRE-4986 and is part of enhancing this for long running tasks.
Additionally, there are some changes to the task runner function, abstracting inputs and breaking out part to an individual function, with interfaces defined to pass mocks+actual dependencies into them for runtime.