Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add configurable timeout #36

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

liamstevens
Copy link
Member

@liamstevens liamstevens commented Jan 30, 2025

Purpose

This PR adds an optional member to the Config struct used for passing arguments to the plugin.

This member configures how long the waiter will wait before returning a timeout.

Context

This is part of CSRE-4986 and is part of enhancing this for long running tasks.

Additionally, there are some changes to the task runner function, abstracting inputs and breaking out part to an individual function, with interfaces defined to pass mocks+actual dependencies into them for runtime.

@liamstevens liamstevens force-pushed the feat/add-configurable-timeout branch from f8e8fb0 to 4807429 Compare January 30, 2025 06:06
@liamstevens liamstevens force-pushed the feat/add-configurable-timeout branch from 4807429 to a53729e Compare January 30, 2025 06:12
@liamstevens liamstevens requested review from therealvio and a team January 30, 2025 06:13
if err != nil {
_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm discarding this error because we're already erroring here. It's a bit late in the afternoon and I don't know how to handle this in an idiomatic way with my frazzled brain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it for tomorrow/next week in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

  • Do we know if a timeout is considered a failure/error?
    • If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly-added message, or the generic.
    • If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design)
    • This could all be answered with: What happens when a timeout occurs in this WaitForCompletion function?
  • Change up the messaging entirely (coward's way out 😤)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

Do we know if a timeout is considered a failure/error?
If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly->added message, or the generic.
If timeouts are not a failure, would a concurrent waiter function be suitable here that >returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it >just sounds like dookie design)
This could all be answered with: What happens when a timeout occurs in this WaitForCompletion >function?
Change up the messaging entirely (coward's way out 😤)

Yeah, so a few things that ultimately change how we need to approach this:

  • Timeouts do not return any actual ECS error messages. Since the type that interacts with them is actually divorced from the infra side of things we only get error details in scenario that something goes wrong upstream - a task is unable to start due to capacity provider config, for instance.
  • Instead we get an error returned: fmt.Errorf("exceeded max wait time for TasksStopped waiter") (ref: go/pkg/mod/github.com/aws/aws-sdk-go-v2/service/[email protected]/api_op_DescribeTasks.go:569).
  • In fact, you cannot expect a *DescribeTasksOutput in any circumstance where the wait method returns an error. So, we should assume that errors are actually going to be issues encountered in the method itself, rather than anything upstream. The inverse is only partially true, we are going to have cases where the task succeeds end-to-end, and others where a DescribeTasksOutput.Failure array is populated.
  • I'm thinking we can use this as a sentinel error to annotate the build in this specific case, and we should annotate with any failure messages that are returned in the case where the return is successful.

Copy link
Contributor

@therealvio therealvio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin.yml will need to be updated so the property is available through the plugin config.

@liamstevens
Copy link
Member Author

The plugin.yml will need to be updated so the property is available through the plugin config.

It felt like I had forgotten something.

src/plugin/task-runner.go Show resolved Hide resolved
if err != nil {
_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

  • Do we know if a timeout is considered a failure/error?
    • If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly-added message, or the generic.
    • If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design)
    • This could all be answered with: What happens when a timeout occurs in this WaitForCompletion function?
  • Change up the messaging entirely (coward's way out 😤)

src/plugin/config_test.go Outdated Show resolved Hide resolved
src/aws/ecs_test.go Show resolved Hide resolved
@liamstevens liamstevens force-pushed the feat/add-configurable-timeout branch from 60f4f80 to c6dd545 Compare January 31, 2025 05:10
@liamstevens liamstevens force-pushed the feat/add-configurable-timeout branch from dfe100e to b6d57d1 Compare January 31, 2025 05:55
Copy link
Contributor

@therealvio therealvio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! It should be a matter of adding a test to ensure behaviour of the timeout interaction and we're good to close.

Comment on lines 66 to 85
err := buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout (%d seconds)", config.TimeOut), "error", "ecs-task-runner")
if err != nil {
return fmt.Errorf("failed to annotate buildkite with task timeout failure: %w", err)
}
}
bkerr := buildKiteAgent.Annotate(ctx, fmt.Sprintf("failed to wait for task completion: %v\n", err), "error", "ecs-task-runner")
if bkerr != nil {
return fmt.Errorf("failed to annotate buildkite with task wait failure: %w, annotation error: %w", err, bkerr)
}
} else if len(result.Failures) > 0 {
// There is still a scenario where the task could return failures but this isn't handled by the waiter
// This is due to the waiter only returning errors in scenarios where there are issues querying the task
// or scheduling the task. For a list of the Failures that can be returned in this case, see:
// https://docs.aws.amazon.com/AmazonECS/latest/developerguide/api_failures_messages.html
// specifically, under the `DescribeTasks` API.
err := buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully: %v", result.Failures[0]), "error", "ecs-task-runner")
if err != nil {
return fmt.Errorf("failed to annotate buildkite with task failure: %w", err)
}
return fmt.Errorf("task did not complete successfully: %v", result.Failures[0])
Copy link
Contributor

@therealvio therealvio Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is MUCH better 🙌 . There's some new error-handling logic, and I don't believe our current test suite handles that.

Can we look at adding a test for this? Methinks we should be able to mock a return to force the timeout message. This will need to be a new test function, and we can skip the table test since it's only one case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circling back to this, I only got some time at the end of the day to take a peek.

It looks like testing this set of error handling might not be super easy to do currently. We will probably need to add some modifications to the definition of TaskRunner to allow for us to mock the waiter interface.

I'll take a look at that tomorrow if I'm on deck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sweat. Happy to put our heads together on this too.

@liamstevens liamstevens force-pushed the feat/add-configurable-timeout branch from 8891615 to 88cc997 Compare February 7, 2025 05:30
@liamstevens liamstevens force-pushed the feat/add-configurable-timeout branch from 88cc997 to 470f2a8 Compare February 7, 2025 05:39
Copy link

github-actions bot commented Feb 7, 2025

Package Line Rate Health
github.com/cultureamp/ecs-task-runner-buildkite-plugin/aws 83%
github.com/cultureamp/ecs-task-runner-buildkite-plugin/buildkite 0%
github.com/cultureamp/ecs-task-runner-buildkite-plugin 0%
github.com/cultureamp/ecs-task-runner-buildkite-plugin/plugin 4%
Summary 39% (95 / 246)

1 similar comment
Copy link

github-actions bot commented Feb 7, 2025

Package Line Rate Health
github.com/cultureamp/ecs-task-runner-buildkite-plugin/aws 83%
github.com/cultureamp/ecs-task-runner-buildkite-plugin/buildkite 0%
github.com/cultureamp/ecs-task-runner-buildkite-plugin 0%
github.com/cultureamp/ecs-task-runner-buildkite-plugin/plugin 4%
Summary 39% (95 / 246)

}}, nil
},
},
"slowpoke": {
Copy link
Contributor

@therealvio therealvio Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is beanz, or slowpoke?

This may impact the tests since the strings match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants