feat: add configurable timeout #36

liamstevens · 2025-01-30T06:02:36Z

Purpose

This PR adds an optional member to the Config struct used for passing arguments to the plugin.

This member configures how long the waiter will wait before returning a timeout.

Context

This is part of CSRE-4986 and is part of enhancing this for long running tasks.

Additionally, there are some changes to the task runner function, abstracting inputs and breaking out part to an individual function, with interfaces defined to pass mocks+actual dependencies into them for runtime.

liamstevens · 2025-01-30T06:28:11Z

src/plugin/task-runner.go

 	if err != nil {
+		_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner")


I'm discarding this error because we're already erroring here. It's a bit late in the afternoon and I don't know how to handle this in an idiomatic way with my frazzled brain.

Let's leave it for tomorrow/next week in that case.

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

Do we know if a timeout is considered a failure/error?

If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly-added message, or the generic.

If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design)

This could all be answered with: What happens when a timeout occurs in this WaitForCompletion function?

Change up the messaging entirely (coward's way out 😤)

Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

Do we know if a timeout is considered a failure/error?
If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly->added message, or the generic.
If timeouts are not a failure, would a concurrent waiter function be suitable here that >returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it >just sounds like dookie design)
This could all be answered with: What happens when a timeout occurs in this WaitForCompletion >function?
Change up the messaging entirely (coward's way out 😤)

Yeah, so a few things that ultimately change how we need to approach this:

Timeouts do not return any actual ECS error messages. Since the type that interacts with them is actually divorced from the infra side of things we only get error details in scenario that something goes wrong upstream - a task is unable to start due to capacity provider config, for instance.

Instead we get an error returned: fmt.Errorf("exceeded max wait time for TasksStopped waiter") (ref: go/pkg/mod/github.com/aws/aws-sdk-go-v2/service/[email protected]/api_op_DescribeTasks.go:569).

In fact, you cannot expect a *DescribeTasksOutput in any circumstance where the wait method returns an error. So, we should assume that errors are actually going to be issues encountered in the method itself, rather than anything upstream. The inverse is only partially true, we are going to have cases where the task succeeds end-to-end, and others where a DescribeTasksOutput.Failure array is populated.

I'm thinking we can use this as a sentinel error to annotate the build in this specific case, and we should annotate with any failure messages that are returned in the case where the return is successful.

therealvio

The plugin.yml will need to be updated so the property is available through the plugin config.

liamstevens · 2025-01-30T06:31:02Z

The plugin.yml will need to be updated so the property is available through the plugin config.

It felt like I had forgotten something.

src/plugin/task-runner.go

therealvio · 2025-01-30T06:48:57Z

src/plugin/task-runner.go

 	if err != nil {
+		_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner")


Looking at this: the two messages may not be accurate together.

Some ideas/thoughts:

Do we know if a timeout is considered a failure/error?

If it's a failure, can we check the Failure contents and return a fmt.Errorf using the newly-added message, or the generic.

If timeouts are not a failure, would a concurrent waiter function be suitable here that returns a fmt.Error? (I don't personally like this idea at all, besides adding complexity, it just sounds like dookie design)

This could all be answered with: What happens when a timeout occurs in this WaitForCompletion function?

Change up the messaging entirely (coward's way out 😤)

src/plugin/config_test.go

src/aws/ecs_test.go

therealvio

Nice job! It should be a matter of adding a test to ensure behaviour of the timeout interaction and we're good to close.

therealvio · 2025-02-03T22:10:59Z

src/plugin/task-runner.go

+			err := buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout (%d seconds)", config.TimeOut), "error", "ecs-task-runner")
+			if err != nil {
+				return fmt.Errorf("failed to annotate buildkite with task timeout failure: %w", err)
+			}
+		}
+		bkerr := buildKiteAgent.Annotate(ctx, fmt.Sprintf("failed to wait for task completion: %v\n", err), "error", "ecs-task-runner")
+		if bkerr != nil {
+			return fmt.Errorf("failed to annotate buildkite with task wait failure: %w, annotation error: %w", err, bkerr)
+		}
+	} else if len(result.Failures) > 0 {
+		// There is still a scenario where the task could return failures but this isn't handled by the waiter
+		// This is due to the waiter only returning errors in scenarios where there are issues querying the task
+		// or scheduling the task. For a list of the Failures that can be returned in this case, see:
+		// https://docs.aws.amazon.com/AmazonECS/latest/developerguide/api_failures_messages.html
+		// specifically, under the `DescribeTasks` API.
+		err := buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully: %v", result.Failures[0]), "error", "ecs-task-runner")
+		if err != nil {
+			return fmt.Errorf("failed to annotate buildkite with task failure: %w", err)
+		}
+		return fmt.Errorf("task did not complete successfully: %v", result.Failures[0])


This is MUCH better 🙌 . There's some new error-handling logic, and I don't believe our current test suite handles that.

Can we look at adding a test for this? Methinks we should be able to mock a return to force the timeout message. This will need to be a new test function, and we can skip the table test since it's only one case.

Circling back to this, I only got some time at the end of the day to take a peek.

It looks like testing this set of error handling might not be super easy to do currently. We will probably need to add some modifications to the definition of TaskRunner to allow for us to mock the waiter interface.

I'll take a look at that tomorrow if I'm on deck.

No sweat. Happy to put our heads together on this too.

…this

github-actions · 2025-02-07T15:07:23Z

Package	Line Rate	Health
github.com/cultureamp/ecs-task-runner-buildkite-plugin/aws	83%	✔
github.com/cultureamp/ecs-task-runner-buildkite-plugin/buildkite	0%	❌
github.com/cultureamp/ecs-task-runner-buildkite-plugin	0%	❌
github.com/cultureamp/ecs-task-runner-buildkite-plugin/plugin	4%	❌
Summary	39% (95 / 246)	❌

github-actions · 2025-02-07T15:09:33Z

Package	Line Rate	Health
github.com/cultureamp/ecs-task-runner-buildkite-plugin/aws	83%	✔
github.com/cultureamp/ecs-task-runner-buildkite-plugin/buildkite	0%	❌
github.com/cultureamp/ecs-task-runner-buildkite-plugin	0%	❌
github.com/cultureamp/ecs-task-runner-buildkite-plugin/plugin	4%	❌
Summary	39% (95 / 246)	❌

therealvio · 2025-02-10T01:09:48Z

src/aws/ecs_test.go

+					}}, nil
+			},
+		},
+		"slowpoke": {


is beanz, or slowpoke?

This may impact the tests since the strings match

liamstevens force-pushed the feat/add-configurable-timeout branch from f8e8fb0 to 4807429 Compare January 30, 2025 06:06

liamstevens added 2 commits January 30, 2025 16:12

feat: add configurable timeout

41e8cfc

feat: add annotation of build when wait failed

a53729e

liamstevens force-pushed the feat/add-configurable-timeout branch from 4807429 to a53729e Compare January 30, 2025 06:12

liamstevens requested review from therealvio and a team January 30, 2025 06:13

liamstevens commented Jan 30, 2025

View reviewed changes

therealvio reviewed Jan 30, 2025

View reviewed changes

fix: add attribute to plugin.yml

857d5fb

liamstevens requested a review from therealvio January 30, 2025 06:32

therealvio requested changes Jan 30, 2025

View reviewed changes

fix: rework error handling of waiter

c6dd545

liamstevens force-pushed the feat/add-configurable-timeout branch from 60f4f80 to c6dd545 Compare January 31, 2025 05:10

fix: test descriptions

5385ce6

liamstevens requested a review from therealvio January 31, 2025 05:14

fix: update README

b6d57d1

liamstevens force-pushed the feat/add-configurable-timeout branch from dfe100e to b6d57d1 Compare January 31, 2025 05:55

therealvio reviewed Feb 3, 2025

View reviewed changes

liamstevens force-pushed the feat/add-configurable-timeout branch from 8891615 to 88cc997 Compare February 7, 2025 05:30

feat: add interfaces for task runner and basic test to piggyback off …

470f2a8

…this

liamstevens force-pushed the feat/add-configurable-timeout branch from 88cc997 to 470f2a8 Compare February 7, 2025 05:39

liamstevens and others added 4 commits February 7, 2025 15:46

Merge branch 'main' into feat/add-configurable-timeout

e766d7f

fix: linting

0f51a16

fix: readme

a59458a

fix: linting

4b3cfac

liamstevens requested a review from therealvio February 7, 2025 05:56

therealvio reviewed Feb 10, 2025

View reviewed changes

fix: update test reasoning

23951dd

liamstevens requested a review from therealvio February 10, 2025 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add configurable timeout #36

feat: add configurable timeout #36

liamstevens commented Jan 30, 2025 •

edited

Loading

liamstevens Jan 30, 2025

therealvio Jan 30, 2025

therealvio Jan 30, 2025

liamstevens Jan 31, 2025

therealvio left a comment

liamstevens commented Jan 30, 2025

therealvio Jan 30, 2025

therealvio left a comment

therealvio Feb 3, 2025 •

edited

Loading

liamstevens Feb 5, 2025

therealvio Feb 5, 2025

github-actions bot commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

therealvio Feb 10, 2025 •

edited

Loading

		if err != nil {
		_ = buildKiteAgent.Annotate(ctx, fmt.Sprintf("Task did not complete successfully within timeout %v", result.Failures[0]), "error", "ecs-task-runner")

feat: add configurable timeout #36

Are you sure you want to change the base?

feat: add configurable timeout #36

Conversation

liamstevens commented Jan 30, 2025 • edited Loading

Purpose

Context

liamstevens Jan 30, 2025

Choose a reason for hiding this comment

therealvio Jan 30, 2025

Choose a reason for hiding this comment

therealvio Jan 30, 2025

Choose a reason for hiding this comment

liamstevens Jan 31, 2025

Choose a reason for hiding this comment

therealvio left a comment

Choose a reason for hiding this comment

liamstevens commented Jan 30, 2025

therealvio Jan 30, 2025

Choose a reason for hiding this comment

therealvio left a comment

Choose a reason for hiding this comment

therealvio Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

liamstevens Feb 5, 2025

Choose a reason for hiding this comment

therealvio Feb 5, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2025

github-actions bot commented Feb 7, 2025

therealvio Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

liamstevens commented Jan 30, 2025 •

edited

Loading

therealvio Feb 3, 2025 •

edited

Loading

therealvio Feb 10, 2025 •

edited

Loading