Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add configurable timeout #36

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
2 changes: 2 additions & 0 deletions plugin.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ configuration:
type: string
script:
type: string
timeout:
type: integer
5 changes: 2 additions & 3 deletions src/aws/ecs.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,11 +59,10 @@ func SubmitTask(ctx context.Context, ecsAPI EcsClientAPI, input *TaskRunnerConfi
return *response.Tasks[0].TaskArn, nil
}

func WaitForCompletion(ctx context.Context, waiter ecsWaiterAPI, taskArn string) (*ecs.DescribeTasksOutput, error) {
func WaitForCompletion(ctx context.Context, waiter ecsWaiterAPI, taskArn string, timeOut int) (*ecs.DescribeTasksOutput, error) {
cluster := ClusterFromTaskArn(taskArn)

// TODO: This magic number will be resolved in a future piece of work, not going to refactor this right now
maxWaitDuration := 15 * time.Minute //nolint:mnd
maxWaitDuration := time.Duration(timeOut) * time.Second
result, err := waiter.WaitForOutput(ctx, &ecs.DescribeTasksInput{
Cluster: aws.String(cluster),
Tasks: []string{taskArn},
Expand Down
2 changes: 1 addition & 1 deletion src/aws/ecs_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ func TestWaitForCompletion(t *testing.T) {

for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
result, err := WaitForCompletion(context.TODO(), tc.waiter, tc.input)
result, err := WaitForCompletion(context.TODO(), tc.waiter, tc.input, 15)
therealvio marked this conversation as resolved.
Show resolved Hide resolved
t.Logf("result: '%v'", err)
t.Logf("expected: detail: %v, reason: %v", *tc.expected.Failures[0].Detail, *tc.expected.Failures[0].Reason)

Expand Down
1 change: 1 addition & 0 deletions src/plugin/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
type Config struct {
ParameterName string `required:"true" split_words:"true"`
Script string `required:"true" split_words:"true"`
TimeOut int `default:"2700" split_words:"true"`
}

type EnvironmentConfigFetcher struct {
Expand Down
10 changes: 10 additions & 0 deletions src/plugin/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ func TestFailOnMissingRequiredEnvironment(t *testing.T) {
disabledEnvVars: []string{
"BUILDKITE_PLUGIN_ECS_TASK_RUNNER_PARAMETER_NAME",
"BUILDKITE_PLUGIN_ECS_TASK_RUNNER_SCRIPT",
"BUILDKITE_PLUGIN_ECS_TASK_RUNNER_TIMEOUT",
},
enabledEnvVars: map[string]string{},
expectedErr: "required key BUILDKITE_PLUGIN_ECS_TASK_RUNNER_PARAMETER_NAME missing value",
Expand Down Expand Up @@ -72,18 +73,27 @@ func TestFailOnMissingRequiredEnvironment(t *testing.T) {
func TestFetchConfigFromEnvironment(t *testing.T) {
unsetEnv(t, "BUILDKITE_PLUGIN_ECS_TASK_RUNNER_PARAMETER_NAME")
unsetEnv(t, "BUILDKITE_PLUGIN_ECS_TASK_RUNNER_SCRIPT")
unsetEnv(t, "BUILDKITE_PLUGIN_ECS_TASK_RUNNER_TIME_OUT")

var config plugin.Config
fetcher := plugin.EnvironmentConfigFetcher{}

t.Setenv("BUILDKITE_PLUGIN_ECS_TASK_RUNNER_PARAMETER_NAME", "test-parameter")
t.Setenv("BUILDKITE_PLUGIN_ECS_TASK_RUNNER_SCRIPT", "hello-world")
t.Setenv("BUILDKITE_PLUGIN_ECS_TASK_RUNNER_TIME_OUT", "600")

err := fetcher.Fetch(&config)

require.NoError(t, err, "fetch should not error")
assert.Equal(t, "test-parameter", config.ParameterName, "fetched message should match environment")
assert.Equal(t, "hello-world", config.Script, "fetched message should match environment")
assert.Equal(t, 600, config.TimeOut, "fetched message should match environment")
liamstevens marked this conversation as resolved.
Show resolved Hide resolved

// test default value
unsetEnv(t, "BUILDKITE_PLUGIN_ECS_TASK_RUNNER_TIME_OUT")
err = fetcher.Fetch(&config)
require.NoError(t, err, "fetch should not error")
assert.Equal(t, 2700, config.TimeOut, "fetched message should match environment")
}

func unsetEnv(t *testing.T, key string) {
Expand Down
4 changes: 3 additions & 1 deletion src/plugin/task-runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ func (trp TaskRunnerPlugin) Run(ctx context.Context, fetcher ConfigFetcher) erro
if err != nil {
return fmt.Errorf("plugin configuration error: %w", err)
}
buildKiteAgent := buildkite.Agent{}

buildkite.Log("Executing task-runner plugin\n")

Expand Down Expand Up @@ -57,8 +58,9 @@ func (trp TaskRunnerPlugin) Run(ctx context.Context, fetcher ConfigFetcher) erro
// TODO: This is currently a magic number. If we want this to be configurable, remove the nolint directive and fix it up
o.MaxDelay = 10 * time.Second //nolint:mnd
therealvio marked this conversation as resolved.
Show resolved Hide resolved
})
result, err := awsinternal.WaitForCompletion(ctx, waiterClient, taskArn)
result, err := awsinternal.WaitForCompletion(ctx, waiterClient, taskArn, config.TimeOut)
if err != nil {
_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm discarding this error because we're already erroring here. It's a bit late in the afternoon and I don't know how to handle this in an idiomatic way with my frazzled brain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it for tomorrow/next week in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

  • Do we know if a timeout is considered a failure/error?
    • If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly-added message, or the generic.
    • If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design)
    • This could all be answered with: What happens when a timeout occurs in this WaitForCompletion function?
  • Change up the messaging entirely (coward's way out 😤)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

Do we know if a timeout is considered a failure/error?
If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly->added message, or the generic.
If timeouts are not a failure, would a concurrent waiter function be suitable here that >returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it >just sounds like dookie design)
This could all be answered with: What happens when a timeout occurs in this WaitForCompletion >function?
Change up the messaging entirely (coward's way out 😤)

Yeah, so a few things that ultimately change how we need to approach this:

  • Timeouts do not return any actual ECS error messages. Since the type that interacts with them is actually divorced from the infra side of things we only get error details in scenario that something goes wrong upstream - a task is unable to start due to capacity provider config, for instance.
  • Instead we get an error returned: fmt.Errorf("exceeded max wait time for TasksStopped waiter") (ref: go/pkg/mod/github.com/aws/aws-sdk-go-v2/service/[email protected]/api_op_DescribeTasks.go:569).
  • In fact, you cannot expect a *DescribeTasksOutput in any circumstance where the wait method returns an error. So, we should assume that errors are actually going to be issues encountered in the method itself, rather than anything upstream. The inverse is only partially true, we are going to have cases where the task succeeds end-to-end, and others where a DescribeTasksOutput.Failure array is populated.
  • I'm thinking we can use this as a sentinel error to annotate the build in this specific case, and we should annotate with any failure messages that are returned in the case where the return is successful.

return fmt.Errorf("failed to wait for task completion: %w\nFailure information: %v", err, result.Failures[0])
}
// In a successful scenario for task completion, we would have a `tasks` slice with a single element
Expand Down