Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[artifacts] Improve cloud deployment error handling #137775

Merged
merged 6 commits into from
Sep 6, 2022
Merged

Conversation

jbudz
Copy link
Member

@jbudz jbudz commented Aug 1, 2022

This updates Cloud tests to:

  • soft_fail only on deployment creation (and not functional tests, for example)
  • skip retries if there's a deployment creation error
  • re-use the previously created docker build if we do retry

I plan on following up with updating slack notifications to distinguish a soft failure. Open to ideas on how to improve the UX further.

Failure: https://buildkite.com/elastic/kibana-artifacts-snapshot/builds/574
Success: https://buildkite.com/elastic/kibana-artifacts-snapshot/builds/575

@jbudz jbudz added Team:Operations Team label for Operations Team release_note:skip Skip the PR/issue when compiling release notes skip-ci labels Aug 1, 2022
@jbudz jbudz closed this Aug 16, 2022
@jbudz jbudz reopened this Aug 31, 2022
@jbudz jbudz force-pushed the artifacts/exit-status branch from 08b6a37 to a5ee12b Compare August 31, 2022 18:18
@jbudz jbudz added ci:skip-when-possible (Deprecated) no-op, managed automatically and removed skip-ci labels Aug 31, 2022
@jbudz jbudz marked this pull request as ready for review August 31, 2022 19:30
@jbudz jbudz requested a review from a team as a code owner August 31, 2022 19:30
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@jbudz jbudz added the backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) label Aug 31, 2022
Copy link
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, but LGTM

agents:
queue: n2-2
timeout_in_minutes: 30
if: "build.env('RELEASE_BUILD') == null || build.env('RELEASE_BUILD') == '' || build.env('RELEASE_BUILD') == 'false'"
retry:
automatic:
# Matches buildkite forced agent shutdown (timeout_in_minutes) and ecctl create failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you'll only get a 255 if the job times out AND the job gracefully stops. If the job has to be hard-killed because it's taking too long to exit after the timeout, you'll get a -1.

Copy link
Member Author

@jbudz jbudz Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, -1 could also be a pre-emption hard-kill too right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, anything where the job or agent shuts down not gracefully. Problem with the GCP instance, OOM kill, etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this isn't using spot instances

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed b774d16

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@jbudz
Copy link
Member Author

jbudz commented Sep 6, 2022

@jbudz jbudz merged commit 874e4cb into main Sep 6, 2022
@jbudz jbudz deleted the artifacts/exit-status branch September 6, 2022 17:43
@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
8.4 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 137775

Questions ?

Please refer to the Backport tool documentation

jbudz added a commit that referenced this pull request Sep 6, 2022
* [artifacts] Improve cloud deployment error handling

* Update .buildkite/scripts/steps/artifacts/cloud.sh

Co-authored-by: Spencer <[email protected]>

* update retry codes

Co-authored-by: Spencer <[email protected]>
@jbudz jbudz added the v8.4.0 label Sep 6, 2022
jbudz added a commit that referenced this pull request Sep 6, 2022
* [artifacts] Improve cloud deployment error handling

* Update .buildkite/scripts/steps/artifacts/cloud.sh

Co-authored-by: Spencer <[email protected]>

* update retry codes

Co-authored-by: Spencer <[email protected]>

Co-authored-by: Spencer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (9.0) the previous minor version (i.e. one version back from main) ci:skip-when-possible (Deprecated) no-op, managed automatically release_note:skip Skip the PR/issue when compiling release notes Team:Operations Team label for Operations Team v8.4.0 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants