upgrade testing: make script error handling more robust #25152

tgross · 2025-02-18T21:58:58Z

We're using set -eo pipefail everywhere in the Enos scripts, several of the scripts used for checking assertions didn't take advantage of pipefail in such a way that we could avoid early exits from transient errors. This meant that if a server was slightly late to come back up, we'd hit an error and exit the whole script instead of polling as expected.

While fixing this, I've made a number of other improvements to the shell scripts:

I've changed the design of the polling loops so that we're calling a function that returns an exit code and sets last_error value, along with any global variables required by downstream functions. This makes the loops more readable by reducing the number of global variables, and helped identify some places where we're exiting instead of returning into the loop.
Using shellcheck -s bash I fixes some unused variables and undefined variables that we were missing because they were only used on the error paths.

Ref: https://hashicorp.atlassian.net/browse/NET-11546

We're using `set -eo pipefail` everywhere in the Enos scripts, several of the scripts used for checking assertions didn't take advantage of pipefail in such a way that we could avoid early exits from transient errors. This meant that if a server was slightly late to come back up, we'd hit an error and exit the whole script instead of polling as expected. While fixing this, I've made a number of other improvements to the shell scripts: * I've changed the design of the polling loops so that we're calling a function that returns an exit code and sets `last_error` value, along with any global variables required by downstream functions. This makes the loops more readable by reducing the number of global variables, and helped identify some places where we're exiting instead of returning into the loop. * Using `shellcheck -s bash` I fixes some unused variables and undefined variables that we were missing because they were only used on the error paths.

Juanadelacuesta

LGTM :)

While testing #25172 I found a few spots where #25152 wasn't capturing the errors from transient failures correctly or exiting early instead of retrying. Ref: https://hashicorp.atlassian.net/browse/NET-11546

tgross added the theme/testing Test related issues label Feb 18, 2025

tgross added this to the 1.10.0 milestone Feb 18, 2025

tgross force-pushed the enos-script-error-handling branch from 32f1580 to d0a749f Compare February 19, 2025 19:16

vercel bot deployed to Preview – nomad-ui February 19, 2025 19:17 View deployment

tgross force-pushed the enos-script-error-handling branch from d0a749f to 1891229 Compare February 19, 2025 19:47

vercel bot deployed to Preview – nomad-ui February 19, 2025 19:49 View deployment

tgross marked this pull request as ready for review February 19, 2025 19:49

tgross requested review from a team as code owners February 19, 2025 19:49

tgross requested a review from Juanadelacuesta February 19, 2025 19:49

Juanadelacuesta approved these changes Feb 20, 2025

View reviewed changes

tgross merged commit 73cd934 into main Feb 20, 2025
31 checks passed

tgross deleted the enos-script-error-handling branch February 20, 2025 13:44

tgross mentioned this pull request Feb 21, 2025

upgrade testing: make sure we capture last error if not exiting #25186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade testing: make script error handling more robust #25152

upgrade testing: make script error handling more robust #25152

tgross commented Feb 18, 2025 •

edited

Loading

Juanadelacuesta left a comment

upgrade testing: make script error handling more robust #25152

upgrade testing: make script error handling more robust #25152

Conversation

tgross commented Feb 18, 2025 • edited Loading

Juanadelacuesta left a comment

Choose a reason for hiding this comment

tgross commented Feb 18, 2025 •

edited

Loading