-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HerokuRun improvements: thread safety, timeout support, status return #158
Conversation
2bed4ba
to
2b1ef66
Compare
(current test failures are from an incorrectly mocked |
d079c91
to
3f6c8cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of adding timeouts and the associated code that goes with them. I'm not hugely in favor of switching to Open3. It's a pretty unewildy change to also keep backwards compat with external usage of $?
@@ -44,22 +46,19 @@ def status | |||
end | |||
|
|||
def call | |||
loop do | |||
begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry on error behavior is also provided by the gem rrrretry
.
3.times.retry(HerokuRunEmptyOutputError) do |i, ex|
execute!
break unless ex
break unless @retry_on_empty
@empty_fail_count = i
message = String.new("Empty output from command #{@command}, retrying the command.")
message << "\nTo disable pass in `retry_on_empty: false` or set HATCHET_DISABLE_EMPTY_RUN_RETRY=1 globally"
message << "\nfailed_count: #{@empty_fail_count}"
message << "\nreleases: #{@app.releases}"
message << "\n#{caller.join("\n")}"
end
I'm not sure honestly. All the code is kinda gnarly for handling this. I like raising an exception for empty output though. That seems good, and I'm assuming you're using it elsewhere.
lib/hatchet/heroku_run.rb
Outdated
# FIXME: usage of $? in tests is very likely not threadsafe, and does not allow a test to distinguish between a TERM of the 'heroku run' itself, or inside the dyno | ||
# this should be part of a proper interface to the run result instead but that's a breaking change | ||
# change app.run to return whole run object which has to_s and to_str for output? | ||
`exit #{status}` # re-set $? for tests that rely on us previously having used backticks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit isn't needed, see my comment in #157. The $?
is already thread local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's thread-local, but in this example:
Hatchet::Runner.new("default_ruby", run_multi: true)
app.run_multi("ls") do |out, status|
expect(status.success?).to be_truthy
expect(out).to include("Gemfile")
end
app.run_multi("ruby -v") do |out, status|
expect(status.success?).to be_truthy
expect(out).to include("ruby")
end
puts $?
end
Which of the two runs' $?
will puts
print?
lib/hatchet/heroku_run.rb
Outdated
else | ||
# the dyno exited and the CLI is reporting that back as an exit status | ||
`exit #{status.exitstatus}` | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we don't move to Open3 here. I think timeout code is good, but the open command combinations are pretty unwieldy here, especially having to emulate the support for the prior $?
is a lot.
I know later you talked about separating stdout from stderr as a goal. I think you also mentioned as an alternative to introduce a new interface. I think that would be my preference to have a top level App
method return an object and let that object wrap all the logic to the call to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But backticks can't capture stderr, so what do we do about that then?
We don't really have to emulate support for $?
, that's always been an implementational detail that should never leak to calling code, but the result info must then be exposed in some other way, which is what :return_obj => true
does, but I'm not sure I like that (changing return signature using an arg is very wrong IMO).
lib/hatchet/heroku_run.rb
Outdated
message << "\nOutput until moment of termination was: #{@output}" | ||
message << "\nTo disable pass in `timeout: 0` or set HATCHET_DEFAULT_RUN_TIMEOUT=0 globally" | ||
message << "\nfailed_count: #{@timeout_fail_count}" | ||
message << "\nreleases: #{@app.releases}" | ||
message << "\n#{caller.join("\n")}" | ||
@stderr.puts message | ||
@app.platform_api.dyno.stop(@app.name, @dyno_id) if @dyno_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting. Did you hit something where this was necessary or is it more of a "best practice" kind of thing? I'm guessing that the underlying command will continue to run and we don't want to have a ton of extra dynos hanging about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know what exactly timed out. And if e.g. the command you're running by accident starts a web server that boots successfully, that one-off run would then remain running after we terminate it due to the timeout. So we want to shut it down.
0055c17
to
76eebfb
Compare
7887763
to
53655a3
Compare
The library (Hatchet) that buildpacks use to integration test against Heroku shells out to `heroku run` where we're wanting to be able to configure a timeout that's less than 60 minutes. Instead of working around this on the library side, I think it would make more sense to add a configurable timeout to the cli. This PR against Hatchet explains the issue heroku/hatchet#158. This is a WIP as I'm not very familiar with the CLI or it's development workflow at all. I got as far as getting it to trigger an exception: ``` $ DEBUG=* ./bin/run run echo lol -a issuetriage -timeout 5 # ... Running echo lol -timeout 5 on ⬢ issuetriage... done heroku:run RangeError [ERR_OUT_OF_RANGE] [ERR_OUT_OF_RANGE]: The value of "msecs" is out of range. It must be a non-negative finite number. Received NaN heroku:run at new NodeError (node:internal/errors:329:5) heroku:run at getTimerDuration (node:internal/timers:386:11) heroku:run at TLSSocket.setStreamTimeout [as setTimeout] (node:internal/stream_base_commons:252:11) heroku:run at /Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:126:15 heroku:run at new Promise (<anonymous>) heroku:run at Dyno._rendezvous (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:117:16) heroku:run at Dyno.attach (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:110:27) heroku:run at Dyno._doStart (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:77:28) heroku:run at processTicksAndRejections (node:internal/process/task_queues:94:5) heroku:run at Dyno.start (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:54:9) heroku:run at Run.run (/Users/rschneeman/Documents/projects/work/cli/packages/run/src/commands/run/index.ts:36:13) heroku:run at Run._run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) heroku:run at Config.runCommand (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/config/lib/config.js:151:9) heroku:run at Main.run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/main.js:21:9) heroku:run at Main._run (/Users/rschneeman/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) +0ms RangeError [ERR_OUT_OF_RANGE] [ERR_OUT_OF_RANGE]: The value of "msecs" is out of range. It must be a non-negative finite number. Received NaN at new NodeError (node:internal/errors:329:5) at getTimerDuration (node:internal/timers:386:11) at TLSSocket.setStreamTimeout [as setTimeout] (node:internal/stream_base_commons:252:11) at ~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:126:15 at new Promise (<anonymous>) at Dyno._rendezvous (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:117:16) at Dyno.attach (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:110:27) at Dyno._doStart (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:77:28) at processTicksAndRejections (node:internal/process/task_queues:94:5) at Dyno.start (~/Documents/projects/work/cli/packages/run/src/lib/dyno.ts:54:9) at Run.run (~/Documents/projects/work/cli/packages/run/src/commands/run/index.ts:36:13) at Run._run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) at Config.runCommand (~/Documents/projects/work/cli/node_modules/@oclif/config/lib/config.js:151:9) at Main.run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/main.js:21:9) at Main._run (~/Documents/projects/work/cli/node_modules/@oclif/command/lib/command.js:44:20) ⛄ 3.0.1 🚀 ~/Documents/projects/work/cli (schneems/timeout-lime-out) $ gitx . ⛄ 3.0.1 🚀 ~/Documents/projects/work/cli (schneems/timeout-lime-out) ```
3633ae4
to
8fe75e6
Compare
via option :return_obj => true Allows accessing e.g. process status ($? is not threadsafe, and a leaky detail of the previous underlying implementation via backticks)
…tain info as well
A lot has changed, if we still need/want something like this I think we need to start fresh. |
I think this PR is also a good example of why splitting changes up into smaller pieces (even if it creates additional churn/intermediate steps) is worth it. Had some of these changes been separate PRs (and not just separate commits), then they could have merged 18 months ago. Yes, some of the changes are dependant on each other, but not all of them. (And even for the changes that are currently dependent on each other, I suspect there are ways to refactor to avoid that.) |
Yup, can definitely be split into two, maybe three PRs. Since the PHP buildpack relies on this branch, I'll get to cleaning it up and opening those PRs sometime. |
why
As explained in #157, backticks followed by
$?
access aren't threadsafe, which may cause trouble when usingrun_multi
. There also isn't a way for tests to get the exit status of aheroku run
without using$?
, which is an implementational detail internal to Hatchet due to the usage of backticks for shelling out.We also want support for timeouts - sometimes,
heroku run
s can hang for a long time for various reasons.thread safety fix
This new implementation uses Open3 to shell out to
heroku run
, and because it captures stderr separately, it paves a possible way for fixing #156 as well.timeouts
A new
:timeout
argument, defaulting to 60 seconds, will terminate a hanging or slowheroku run
, because sometimes dynos do not connect (and the Heroku CLI timeout is hardcoded to one hour), sometimes tests hang, and sometimes tests that are supposed to not boot a web server will suddenly boot a web server after all and then sit there forever.The timeout functionality now successfully prevents flappy failures for the PHP buildpack:
retries
Both timeout and empty output retries are now implemented using exceptions rather than a loop.
Both will sleep for a default of one second if
run_multi
is off, preventing possible "can only run one free dyno" errors on retry.dyno ID capture
From the now separately captured stderr, the dyno ID ("
run.1234
") is extracted and printed on retry, possibly aiding debugging.We could also consider not printing stderr at all until an error has occured, resulting in much cleaner test output because all the
heroku run
etc output is gone.backwards compatibility considerations
For what may be considered BC, we still set
$?
by running a quickkill
orexit
with the appropriate signal or code in backticks (depending on whether the program was signaled or not; this allows distinguishing between the two cases, as e.g.termsig
=15 means theheroku run
got killed, andexitstatus=143
means the dyno run wasSIGTERM
inated).This is not threadsafe, but allows any tests that have relied on the previous internal implementation (implemented using backticks) to continue to work until they are migrated to an alternative way of getting the exit status of a
heroku run
(see next section).Generally, breaking the use of
$?
in tests shouldn't be considered a critical BC break, as tests should not rely on internal implementation details of Hatchet, so this should be removed soon.exit status
Right now, this alternative way of getting the exit status of a
heroku run
is an argument torun
that changes its return value: if:return_obj => true
is given as an option, the return value is the wholeHerokuRun
object rather than just the output string, allowing access to.output
and.status
.I tried returning the object and giving the class
to_s
andto_str
methods, butmatch()
etc do not trigger those, so it would break basically every test out there.alternatives
Alternatively, we could have new method that returns the whole object, and the old
run
just wraps that.For reference,
run_multi
yields a block withoutput
andstatus
arguments. Maybe something similar to that?tests
Tests will be added to this PR, I just wanted to get a general +1 first before I weave them in.
review
Please review this by also following the commits in chronological order, I've taken extra care to make the "journey" clear.