HerokuRun improvements: thread safety, timeout support, status return #158

dzuelke · 2020-12-03T20:00:50Z

why

As explained in #157, backticks followed by $? access aren't threadsafe, which may cause trouble when using run_multi. There also isn't a way for tests to get the exit status of a heroku run without using $?, which is an implementational detail internal to Hatchet due to the usage of backticks for shelling out.

We also want support for timeouts - sometimes, heroku runs can hang for a long time for various reasons.

thread safety fix

This new implementation uses Open3 to shell out to heroku run, and because it captures stderr separately, it paves a possible way for fixing #156 as well.

timeouts

A new :timeout argument, defaulting to 60 seconds, will terminate a hanging or slow heroku run, because sometimes dynos do not connect (and the Heroku CLI timeout is hardcoded to one hour), sometimes tests hang, and sometimes tests that are supposed to not boot a web server will suddenly boot a web server after all and then sit there forever.

The timeout functionality now successfully prevents flappy failures for the PHP buildpack:

[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... starting, run.9601 (Standard-1X)
…
[TEST GROUP 8] Run run.9601 with command heroku run --app=htcht-453589629-d79bb6e18f --exit-code --env=WEB_CONCURRENCY\=22 -- ./waitforit.sh\ 15\ \'ready\ for\ connections\'\ heroku-php-apache2\ --verbose timed out after 60 seconds, stopping dyno and retrying...
[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... starting, run.1440 (Standard-1X)
[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... connecting, run.1440 (Standard-1X)
[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... up, run.1440 (Standard-1X)
[TEST GROUP 8] .

retries

Both timeout and empty output retries are now implemented using exceptions rather than a loop.

Both will sleep for a default of one second if run_multi is off, preventing possible "can only run one free dyno" errors on retry.

dyno ID capture

From the now separately captured stderr, the dyno ID ("run.1234") is extracted and printed on retry, possibly aiding debugging.

We could also consider not printing stderr at all until an error has occured, resulting in much cleaner test output because all the heroku run etc output is gone.

backwards compatibility considerations

For what may be considered BC, we still set $? by running a quick kill or exit with the appropriate signal or code in backticks (depending on whether the program was signaled or not; this allows distinguishing between the two cases, as e.g. termsig=15 means the heroku run got killed, and exitstatus=143 means the dyno run was SIGTERMinated).

This is not threadsafe, but allows any tests that have relied on the previous internal implementation (implemented using backticks) to continue to work until they are migrated to an alternative way of getting the exit status of a heroku run (see next section).

Generally, breaking the use of $? in tests shouldn't be considered a critical BC break, as tests should not rely on internal implementation details of Hatchet, so this should be removed soon.

exit status

Right now, this alternative way of getting the exit status of a heroku run is an argument to run that changes its return value: if :return_obj => true is given as an option, the return value is the whole HerokuRun object rather than just the output string, allowing access to .output and .status.

I tried returning the object and giving the class to_s and to_str methods, but match() etc do not trigger those, so it would break basically every test out there.

alternatives

Alternatively, we could have new method that returns the whole object, and the old run just wraps that.

For reference, run_multi yields a block with output and status arguments. Maybe something similar to that?

tests

Tests will be added to this PR, I just wanted to get a general +1 first before I weave them in.

review

Please review this by also following the commits in chronological order, I've taken extra care to make the "journey" clear.

dzuelke · 2020-12-03T23:01:50Z

(current test failures are from an incorrectly mocked Process::Status object, will fix)

schneems

I'm in favor of adding timeouts and the associated code that goes with them. I'm not hugely in favor of switching to Open3. It's a pretty unewildy change to also keep backwards compat with external usage of $?

schneems · 2021-01-13T21:40:26Z

lib/hatchet/heroku_run.rb

@@ -44,22 +46,19 @@ def status
    end

    def call
-      loop do
+      begin


Retry on error behavior is also provided by the gem rrrretry.

3.times.retry(HerokuRunEmptyOutputError) do |i, ex| execute! break unless ex break unless @retry_on_empty @empty_fail_count = i message = String.new("Empty output from command #{@command}, retrying the command.") message << "\nTo disable pass in `retry_on_empty: false` or set HATCHET_DISABLE_EMPTY_RUN_RETRY=1 globally" message << "\nfailed_count: #{@empty_fail_count}" message << "\nreleases: #{@app.releases}" message << "\n#{caller.join("\n")}" end

I'm not sure honestly. All the code is kinda gnarly for handling this. I like raising an exception for empty output though. That seems good, and I'm assuming you're using it elsewhere.

schneems · 2021-01-13T21:41:30Z

lib/hatchet/heroku_run.rb

+      # FIXME: usage of $? in tests is very likely not threadsafe, and does not allow a test to distinguish between a TERM of the 'heroku run' itself, or inside the dyno
+      # this should be part of a proper interface to the run result instead but that's a breaking change
+      # change app.run to return whole run object which has to_s and to_str for output?
+      `exit #{status}` # re-set $? for tests that rely on us previously having used backticks


This commit isn't needed, see my comment in #157. The $? is already thread local.

It's thread-local, but in this example:

Hatchet::Runner.new("default_ruby", run_multi: true) app.run_multi("ls") do |out, status| expect(status.success?).to be_truthy expect(out).to include("Gemfile") end app.run_multi("ruby -v") do |out, status| expect(status.success?).to be_truthy expect(out).to include("ruby") end puts $? end

Which of the two runs' $? will puts print?

lib/hatchet/heroku_run.rb

schneems · 2021-01-13T21:48:06Z

lib/hatchet/heroku_run.rb

+      else
+        # the dyno exited and the CLI is reporting that back as an exit status
+        `exit #{status.exitstatus}`
+      end


I would prefer we don't move to Open3 here. I think timeout code is good, but the open command combinations are pretty unwieldy here, especially having to emulate the support for the prior $? is a lot.

I know later you talked about separating stdout from stderr as a goal. I think you also mentioned as an alternative to introduce a new interface. I think that would be my preference to have a top level App method return an object and let that object wrap all the logic to the call to run.

But backticks can't capture stderr, so what do we do about that then?

We don't really have to emulate support for $?, that's always been an implementational detail that should never leak to calling code, but the result info must then be exposed in some other way, which is what :return_obj => true does, but I'm not sure I like that (changing return signature using an arg is very wrong IMO).

lib/hatchet/heroku_run.rb

schneems · 2021-01-13T21:52:07Z

lib/hatchet/heroku_run.rb

          message << "\nOutput until moment of termination was: #{@output}"
          message << "\nTo disable pass in `timeout: 0` or set HATCHET_DEFAULT_RUN_TIMEOUT=0 globally"
          message << "\nfailed_count: #{@timeout_fail_count}"
          message << "\nreleases: #{@app.releases}"
          message << "\n#{caller.join("\n")}"
          @stderr.puts message
+          @app.platform_api.dyno.stop(@app.name, @dyno_id) if @dyno_id


This is interesting. Did you hit something where this was necessary or is it more of a "best practice" kind of thing? I'm guessing that the underlying command will continue to run and we don't want to have a ton of extra dynos hanging about.

We don't know what exactly timed out. And if e.g. the command you're running by accident starts a web server that boots successfully, that one-off run would then remain running after we terminate it due to the timeout. So we want to shut it down.

The library (Hatchet) that buildpacks use to integration test against Heroku shells out to `heroku run` where we're wanting to be able to configure a timeout that's less than 60 minutes. Instead of working around this on the library side, I think it would make more sense to add a configurable timeout to the cli. This PR against Hatchet explains the issue heroku/hatchet#158. This is a WIP as I'm not very familiar with the CLI or it's development workflow at all. I got as far as getting it to trigger an exception: ``` $ DEBUG=* ./bin/run run echo lol -a issuetriage -timeout 5 # ... Running echo lol -timeout 5 on ⬢ issuetriage... done heroku:run RangeError [ERR_OUT_OF_RANGE] [ERR_OUT_OF_RANGE]: The value of "msecs" is out of range. It must be a non-negative finite number. Received NaN heroku:run at new NodeError (node:internal/errors:329:5) heroku:run at getTimerDuration (node:internal/timers:386:11) heroku:run at TLSSocket.setStreamTimeout [as setTimeout] (node:internal/stream_base_commons:252:11) heroku:run at /Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:126:15 heroku:run at new Promise (<anonymous>) heroku:run at Dyno._rendezvous (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:117:16) heroku:run at Dyno.attach (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:110:27) heroku:run at Dyno._doStart (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:77:28) heroku:run at processTicksAndRejections (node:internal/process/task_queues:94:5) heroku:run at Dyno.start (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:54:9) heroku:run at Run.run (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/commands/run/index.ts:36:13) heroku:run at Run._run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) heroku:run at Config.runCommand (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/config/lib/config.js:151:9) heroku:run at Main.run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/main.js:21:9) heroku:run at Main._run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) +0ms RangeError [ERR_OUT_OF_RANGE] [ERR_OUT_OF_RANGE]: The value of "msecs" is out of range. It must be a non-negative finite number. Received NaN at new NodeError (node:internal/errors:329:5) at getTimerDuration (node:internal/timers:386:11) at TLSSocket.setStreamTimeout [as setTimeout] (node:internal/stream_base_commons:252:11) at ~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:126:15 at new Promise (<anonymous>) at Dyno._rendezvous (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:117:16) at Dyno.attach (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:110:27) at Dyno._doStart (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:77:28) at processTicksAndRejections (node:internal/process/task_queues:94:5) at Dyno.start (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:54:9) at Run.run (~/Documents/projects/work/cli/packages/run/src/commands/run/index.ts:36:13) at Run._run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) at Config.runCommand (~/Documents/projects/work/cli/node_modules/@oclif/config/lib/config.js:151:9) at Main.run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/main.js:21:9) at Main._run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) ⛄ 3.0.1 🚀 ~/Documents/projects/work/cli (schneems/timeout-lime-out) $ gitx . ⛄ 3.0.1 🚀 ~/Documents/projects/work/cli (schneems/timeout-lime-out) ```

via option :return_obj => true Allows accessing e.g. process status ($? is not threadsafe, and a leaky detail of the previous underlying implementation via backticks)

…tain info as well

schneems · 2023-03-13T15:19:09Z

A lot has changed, if we still need/want something like this I think we need to start fresh.

edmorley · 2023-03-13T16:41:05Z

I think this PR is also a good example of why splitting changes up into smaller pieces (even if it creates additional churn/intermediate steps) is worth it. Had some of these changes been separate PRs (and not just separate commits), then they could have merged 18 months ago. Yes, some of the changes are dependant on each other, but not all of them. (And even for the changes that are currently dependent on each other, I suspect there are ways to refactor to avoid that.)

dzuelke · 2023-03-16T10:37:42Z

Yup, can definitely be split into two, maybe three PRs. Since the PHP buildpack relies on this branch, I'll get to cleaning it up and opening those PRs sometime.

dzuelke force-pushed the timeouts-etc branch 2 times, most recently from 2bed4ba to 2b1ef66 Compare December 3, 2020 23:01

dzuelke force-pushed the timeouts-etc branch 2 times, most recently from d079c91 to 3f6c8cb Compare December 4, 2020 08:24

schneems requested changes Jan 13, 2021

View reviewed changes

dzuelke mentioned this pull request Feb 12, 2021

HerokuRun#execute! throttling does not throttle #156

Closed

dzuelke force-pushed the timeouts-etc branch from 0055c17 to 76eebfb Compare February 17, 2021 14:33

dzuelke force-pushed the timeouts-etc branch 2 times, most recently from 7887763 to 53655a3 Compare February 24, 2021 16:51

schneems mentioned this pull request May 21, 2021

WIP Add timeout to heroku run heroku/cli#1814

Closed

dzuelke mentioned this pull request May 28, 2021

[close #156] Capture stderr from heroku run #187

Merged

dzuelke added 2 commits January 24, 2023 19:59

Use exceptions for empty output retries in preparation of related work

6dcf6a0

move to Open3.popen3 in preparation for timeout handling

eaa3eff

dzuelke force-pushed the timeouts-etc branch 2 times, most recently from 3633ae4 to 8fe75e6 Compare January 24, 2023 21:37

dzuelke added 8 commits January 25, 2023 00:50

Timeout support for 'heroku run'

8fa0e45

Delay for empty/timeout retries

7b72315

NEEDS REORDER, EMPTY: Record and print dyno id on retry

83d6744

terminate dyno via API on timeout

81c6fde

Allow returning entire run object from App#run, refs #157

04ae307

via option :return_obj => true Allows accessing e.g. process status ($? is not threadsafe, and a leaky detail of the previous underlying implementation via backticks)

(needs test): do not forget to raise if we are out of retries

207cf12

refactor empty/timeout retry raising so that re-raised exceptions con…

50d0168

…tain info as well

adjust tests to ensure empty output raises eventually

661bcd6

dzuelke force-pushed the timeouts-etc branch from f26a707 to 661bcd6 Compare January 24, 2023 23:54

schneems closed this Mar 13, 2023

schneems mentioned this pull request Jun 21, 2024

Find a way to get back on main for heroku_hatchet heroku/heroku-buildpack-php#734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HerokuRun improvements: thread safety, timeout support, status return #158

HerokuRun improvements: thread safety, timeout support, status return #158

dzuelke commented Dec 3, 2020 •

edited

Loading

dzuelke commented Dec 3, 2020

schneems left a comment •

edited

Loading

schneems Jan 13, 2021

schneems Jan 13, 2021

dzuelke Jan 14, 2021

schneems Jan 13, 2021

dzuelke Jan 14, 2021

schneems Jan 13, 2021

dzuelke Jan 14, 2021

schneems commented Mar 13, 2023

edmorley commented Mar 13, 2023 •

edited

Loading

dzuelke commented Mar 16, 2023

HerokuRun improvements: thread safety, timeout support, status return #158

HerokuRun improvements: thread safety, timeout support, status return #158

Conversation

dzuelke commented Dec 3, 2020 • edited Loading

why

thread safety fix

timeouts

retries

dyno ID capture

backwards compatibility considerations

exit status

alternatives

tests

review

dzuelke commented Dec 3, 2020

schneems left a comment • edited Loading

Choose a reason for hiding this comment

schneems Jan 13, 2021

Choose a reason for hiding this comment

schneems Jan 13, 2021

Choose a reason for hiding this comment

dzuelke Jan 14, 2021

Choose a reason for hiding this comment

schneems Jan 13, 2021

Choose a reason for hiding this comment

dzuelke Jan 14, 2021

Choose a reason for hiding this comment

schneems Jan 13, 2021

Choose a reason for hiding this comment

dzuelke Jan 14, 2021

Choose a reason for hiding this comment

schneems commented Mar 13, 2023

edmorley commented Mar 13, 2023 • edited Loading

dzuelke commented Mar 16, 2023

dzuelke commented Dec 3, 2020 •

edited

Loading

schneems left a comment •

edited

Loading

edmorley commented Mar 13, 2023 •

edited

Loading