Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HerokuRun improvements: thread safety, timeout support, status return #158

Closed
wants to merge 10 commits into from

Conversation

dzuelke
Copy link
Contributor

@dzuelke dzuelke commented Dec 3, 2020

why

As explained in #157, backticks followed by $? access aren't threadsafe, which may cause trouble when using run_multi. There also isn't a way for tests to get the exit status of a heroku run without using $?, which is an implementational detail internal to Hatchet due to the usage of backticks for shelling out.

We also want support for timeouts - sometimes, heroku runs can hang for a long time for various reasons.

thread safety fix

This new implementation uses Open3 to shell out to heroku run, and because it captures stderr separately, it paves a possible way for fixing #156 as well.

timeouts

A new :timeout argument, defaulting to 60 seconds, will terminate a hanging or slow heroku run, because sometimes dynos do not connect (and the Heroku CLI timeout is hardcoded to one hour), sometimes tests hang, and sometimes tests that are supposed to not boot a web server will suddenly boot a web server after all and then sit there forever.

The timeout functionality now successfully prevents flappy failures for the PHP buildpack:

[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... starting, run.9601 (Standard-1X)
…
[TEST GROUP 8] Run run.9601 with command heroku run --app=htcht-453589629-d79bb6e18f --exit-code --env=WEB_CONCURRENCY\=22 -- ./waitforit.sh\ 15\ \'ready\ for\ connections\'\ heroku-php-apache2\ --verbose timed out after 60 seconds, stopping dyno and retrying...
[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... starting, run.1440 (Standard-1X)
[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... connecting, run.1440 (Standard-1X)
[TEST GROUP 8] Running ./waitforit.sh 15 'ready for connections' heroku-php-apache2 --verbose on htcht-453589629-d79bb6e18f... up, run.1440 (Standard-1X)
[TEST GROUP 8] .

retries

Both timeout and empty output retries are now implemented using exceptions rather than a loop.

Both will sleep for a default of one second if run_multi is off, preventing possible "can only run one free dyno" errors on retry.

dyno ID capture

From the now separately captured stderr, the dyno ID ("run.1234") is extracted and printed on retry, possibly aiding debugging.

We could also consider not printing stderr at all until an error has occured, resulting in much cleaner test output because all the heroku run etc output is gone.

backwards compatibility considerations

For what may be considered BC, we still set $? by running a quick kill or exit with the appropriate signal or code in backticks (depending on whether the program was signaled or not; this allows distinguishing between the two cases, as e.g. termsig=15 means the heroku run got killed, and exitstatus=143 means the dyno run was SIGTERMinated).

This is not threadsafe, but allows any tests that have relied on the previous internal implementation (implemented using backticks) to continue to work until they are migrated to an alternative way of getting the exit status of a heroku run (see next section).

Generally, breaking the use of $? in tests shouldn't be considered a critical BC break, as tests should not rely on internal implementation details of Hatchet, so this should be removed soon.

exit status

Right now, this alternative way of getting the exit status of a heroku run is an argument to run that changes its return value: if :return_obj => true is given as an option, the return value is the whole HerokuRun object rather than just the output string, allowing access to .output and .status.

I tried returning the object and giving the class to_s and to_str methods, but match() etc do not trigger those, so it would break basically every test out there.

alternatives

Alternatively, we could have new method that returns the whole object, and the old run just wraps that.

For reference, run_multi yields a block with output and status arguments. Maybe something similar to that?

tests

Tests will be added to this PR, I just wanted to get a general +1 first before I weave them in.

review

Please review this by also following the commits in chronological order, I've taken extra care to make the "journey" clear.

@dzuelke dzuelke force-pushed the timeouts-etc branch 2 times, most recently from 2bed4ba to 2b1ef66 Compare December 3, 2020 23:01
@dzuelke
Copy link
Contributor Author

dzuelke commented Dec 3, 2020

(current test failures are from an incorrectly mocked Process::Status object, will fix)

@dzuelke dzuelke force-pushed the timeouts-etc branch 2 times, most recently from d079c91 to 3f6c8cb Compare December 4, 2020 08:24
Copy link
Contributor

@schneems schneems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of adding timeouts and the associated code that goes with them. I'm not hugely in favor of switching to Open3. It's a pretty unewildy change to also keep backwards compat with external usage of $?

@@ -44,22 +46,19 @@ def status
end

def call
loop do
begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry on error behavior is also provided by the gem rrrretry.

3.times.retry(HerokuRunEmptyOutputError) do  |i, ex|
  execute! 

  break unless ex
  break unless @retry_on_empty
  
  @empty_fail_count = i
  message = String.new("Empty output from command #{@command}, retrying the command.")
  message << "\nTo disable pass in `retry_on_empty: false` or set HATCHET_DISABLE_EMPTY_RUN_RETRY=1 globally"
  message << "\nfailed_count: #{@empty_fail_count}"
  message << "\nreleases: #{@app.releases}"
  message << "\n#{caller.join("\n")}"
end

I'm not sure honestly. All the code is kinda gnarly for handling this. I like raising an exception for empty output though. That seems good, and I'm assuming you're using it elsewhere.

# FIXME: usage of $? in tests is very likely not threadsafe, and does not allow a test to distinguish between a TERM of the 'heroku run' itself, or inside the dyno
# this should be part of a proper interface to the run result instead but that's a breaking change
# change app.run to return whole run object which has to_s and to_str for output?
`exit #{status}` # re-set $? for tests that rely on us previously having used backticks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit isn't needed, see my comment in #157. The $? is already thread local.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's thread-local, but in this example:

Hatchet::Runner.new("default_ruby", run_multi: true)
  app.run_multi("ls") do |out, status|
    expect(status.success?).to be_truthy
    expect(out).to include("Gemfile")
  end
  app.run_multi("ruby -v") do |out, status|
    expect(status.success?).to be_truthy
    expect(out).to include("ruby")
  end
  puts $?
end

Which of the two runs' $? will puts print?

else
# the dyno exited and the CLI is reporting that back as an exit status
`exit #{status.exitstatus}`
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we don't move to Open3 here. I think timeout code is good, but the open command combinations are pretty unwieldy here, especially having to emulate the support for the prior $? is a lot.

I know later you talked about separating stdout from stderr as a goal. I think you also mentioned as an alternative to introduce a new interface. I think that would be my preference to have a top level App method return an object and let that object wrap all the logic to the call to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But backticks can't capture stderr, so what do we do about that then?

We don't really have to emulate support for $?, that's always been an implementational detail that should never leak to calling code, but the result info must then be exposed in some other way, which is what :return_obj => true does, but I'm not sure I like that (changing return signature using an arg is very wrong IMO).

message << "\nOutput until moment of termination was: #{@output}"
message << "\nTo disable pass in `timeout: 0` or set HATCHET_DEFAULT_RUN_TIMEOUT=0 globally"
message << "\nfailed_count: #{@timeout_fail_count}"
message << "\nreleases: #{@app.releases}"
message << "\n#{caller.join("\n")}"
@stderr.puts message
@app.platform_api.dyno.stop(@app.name, @dyno_id) if @dyno_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Did you hit something where this was necessary or is it more of a "best practice" kind of thing? I'm guessing that the underlying command will continue to run and we don't want to have a ton of extra dynos hanging about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know what exactly timed out. And if e.g. the command you're running by accident starts a web server that boots successfully, that one-off run would then remain running after we terminate it due to the timeout. So we want to shut it down.

@dzuelke dzuelke force-pushed the timeouts-etc branch 2 times, most recently from 7887763 to 53655a3 Compare February 24, 2021 16:51
schneems added a commit to heroku/cli that referenced this pull request May 21, 2021
The library (Hatchet) that buildpacks use to integration test against Heroku shells out to `heroku run` where we're wanting to be able to configure a timeout that's less than 60 minutes. Instead of working around this on the library side, I think it would make more sense to add a configurable timeout to the cli.

This PR against Hatchet explains the issue heroku/hatchet#158. 


This is a WIP as I'm not very familiar with the CLI or it's development workflow at all. I got as far as getting it to trigger an exception:

```
$ DEBUG=* ./bin/run run echo lol -a issuetriage -timeout 5
# ...
Running echo lol -timeout 5 on ⬢ issuetriage... done
  heroku:run RangeError [ERR_OUT_OF_RANGE] [ERR_OUT_OF_RANGE]: The value of "msecs" is out of range. It must be a non-negative finite number. Received NaN
  heroku:run     at new NodeError (node:internal/errors:329:5)
  heroku:run     at getTimerDuration (node:internal/timers:386:11)
  heroku:run     at TLSSocket.setStreamTimeout [as setTimeout] (node:internal/stream_base_commons:252:11)
  heroku:run     at /Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:126:15
  heroku:run     at new Promise (<anonymous>)
  heroku:run     at Dyno._rendezvous (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:117:16)
  heroku:run     at Dyno.attach (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:110:27)
  heroku:run     at Dyno._doStart (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:77:28)
  heroku:run     at processTicksAndRejections (node:internal/process/task_queues:94:5)
  heroku:run     at Dyno.start (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:54:9)
  heroku:run     at Run.run (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/commands/run/index.ts:36:13)
  heroku:run     at Run._run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20)
  heroku:run     at Config.runCommand (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/config/lib/config.js:151:9)
  heroku:run     at Main.run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/main.js:21:9)
  heroku:run     at Main._run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) +0ms
RangeError [ERR_OUT_OF_RANGE] [ERR_OUT_OF_RANGE]: The value of "msecs" is out of range. It must be a non-negative finite number. Received NaN
    at new NodeError (node:internal/errors:329:5)
    at getTimerDuration (node:internal/timers:386:11)
    at TLSSocket.setStreamTimeout [as setTimeout] (node:internal/stream_base_commons:252:11)
    at ~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:126:15
    at new Promise (<anonymous>)
    at Dyno._rendezvous (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:117:16)
    at Dyno.attach (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:110:27)
    at Dyno._doStart (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:77:28)
    at processTicksAndRejections (node:internal/process/task_queues:94:5)
    at Dyno.start (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:54:9)
    at Run.run (~/Documents/projects/work/cli/packages/run/src/commands/run/index.ts:36:13)
    at Run._run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20)
    at Config.runCommand (~/Documents/projects/work/cli/node_modules/@oclif/config/lib/config.js:151:9)
    at Main.run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/main.js:21:9)
    at Main._run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20)
⛄ 3.0.1 🚀  ~/Documents/projects/work/cli (schneems/timeout-lime-out)
$ gitx .
⛄ 3.0.1 🚀  ~/Documents/projects/work/cli (schneems/timeout-lime-out)
```
@dzuelke dzuelke force-pushed the timeouts-etc branch 2 times, most recently from 3633ae4 to 8fe75e6 Compare January 24, 2023 21:37
@schneems
Copy link
Contributor

A lot has changed, if we still need/want something like this I think we need to start fresh.

@schneems schneems closed this Mar 13, 2023
@edmorley
Copy link
Member

edmorley commented Mar 13, 2023

I think this PR is also a good example of why splitting changes up into smaller pieces (even if it creates additional churn/intermediate steps) is worth it. Had some of these changes been separate PRs (and not just separate commits), then they could have merged 18 months ago. Yes, some of the changes are dependant on each other, but not all of them. (And even for the changes that are currently dependent on each other, I suspect there are ways to refactor to avoid that.)

@dzuelke
Copy link
Contributor Author

dzuelke commented Mar 16, 2023

Yup, can definitely be split into two, maybe three PRs. Since the PHP buildpack relies on this branch, I'll get to cleaning it up and opening those PRs sometime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants