-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate the timeout thread at exit #21
Conversation
I am not sure how to test this, given that it only affects runtimes that can spin up multiple instances in a single process. CRuby does not have this capability so there's no opportunity to test this. |
In some runtimes, the process may not go away but the Ruby runtime should be shut down. This thread does not clean itself up, which leads to such runtimes leaning on a best-effort kill+join of the dangling threads late in the shutdown process. As a result these threads may linger slightly after shutdown and fail expectations that they will have completed and finalized. This patch adds an `at_exit` for the timeout library to terminate its own thread, which would happen earlier in the shutdown process (`at_exit` execution is considered part of the "user" shutdown phase in JRuby). This appears to fix issues running JRuby's embedding API tests that check for timeout threads to have been cleanly terminated and finalized.
4def132
to
ab6746c
Compare
I updated this based on review comments. It now triggers shutdown by closing the timeout queue and then waits for the thread to complete. The thread itself sees the queue shutdown and terminates itself. There may be some small delay in shutting down the runtime as the timeout thread will need to schedule and run to completion, but it is not doing a hard kill from outside the timeout thread anymore. |
We would like to get this merged and released since the dangling thread causes a number of our JRuby master test suites to fail. |
Actually this approach seems problematic, because there might be code running inside another That's also the case for unit tests of this gem. diff --git a/test/test_timeout.rb b/test/test_timeout.rb
index 76de389..62e3861 100644
--- a/test/test_timeout.rb
+++ b/test/test_timeout.rb
@@ -1,6 +1,7 @@
# frozen_string_literal: false
require 'test/unit'
require 'timeout'
+Timeout.timeout(1){}
class TestTimeout < Test::Unit::TestCase
Then all/most tests fail. I think the real solution here is fixing JRuby so it can reliably kill Ruby-created threads when the instance/context is being closed. This is already what TruffleRuby does. |
I consider this an upstream/JRuby issue so I close this. |
There might be changes needed to mark this thread as kind of a daemon thread, which is something Ruby doesn't have currently, so I'm open to other changes in general to improve dealing with that daemon thread. For this specific case here though I don't think there is anything needed to change here. |
I disagree with your assessment and would like to debate this more. Does TruffleRuby have tests for repeatedly spinning up a runtime, running a timeout, and then confirming all threads are terminated immediately after that runtime is terminated? Is it even possible to terminate a TruffleRuby runtime instance and start another in the same process? Your argument is questionable anyway, since threads shutting down may need to Another aspect of this: your version already changed behavior, by adding a daemon thread that is now visible to the rest of the runtime and still executing when the runtime proceeds to shut down. Ergo, you introduced a behavioral change that affects any implementation that may run multiple runtimes in a given process, since there's now an expectation that implicitly-started threads from this library must also be cleaned up. I could argue that you are the one that introduce the bug here, by not cleaning up resources on shutdown. JRuby solved this problem when we used the JDK version of this logic by adding an explicit shutdown of the executor thread before we proceed with the rest of runtime teardown. The test we added to confirm that shutdown happened is the test that's failing now with the new version of the timeout library. That shutdown is what I'm asking for here. The native JRuby version did not interfere with Ruby behavior because the executor thread was considered separate from the JRuby runtime and threads spawned by Ruby code. All shutdown logic for userspace code and resources ran first, and then the timeout executor was terminated as part of JRuby's internal teardown. Unfortunately we don't have a way to mark library resources as being internal or system-level, so there will always be potential conflict when trying to release those resources. And in the end you not are the sole maintainer or consumer of this library, and I do not think it is appropriate for you to unilaterally close this without discussion. Remember you already had several standard libraries incorporate patches to ignore the gem contents and use the versions TruffleRuby ships. That is a TruffleRuby issue, so I propose we also remove those changes if we are not in the business of making these libraries accommodate non-CRuby runtimes. |
If the shutdown at_exit runs before other at_exit or termination logic that still depends on the timeout thread, those subsequent uses of timeout may fail. This additional patch allows the timeout thread and queue to be restarted after the at_exit has fired, with the side effect that it will not try to shutdown a second time. This addresses review comments from ruby#21.
I have reopened this PR and added a commit that makes it possible for timeout to be restarted, for the rare cases where the Your modification to the test suite passes as before with the additional patch in place.
|
I'm working on refining the shutdown. The locking gets tricky when we have parallel-executing implementations, which is causing some JRuby and TruffleRuby runs to hang. |
This allows more atomic shutdown; either the queue close or the thread terminating indicates the executor has been shut down, and a new one can be created and put into action without interfering with the shutdown of a previous one.
I've committed a different version that only requires a single lock to manage a new Executor instance; when it is shut down, either via the queue being closed or the thread terminating, it will be seen as not alive and a new one instantiated. This allows CRuby to pass all tests (even with the |
As noted in #17, an alternative to a "restartable" executor would be to just fall back on the old thread-per-timeout logic when the executor has been shutdown (or when we are running in a context that disallows mutex usage). There are many different ways to make this shutdown logic work cleanly. There's no valid reason to close this PR without discussing those options. |
See #7349 and the fix in ruby/timeout#21 which will address it.
Yes. An instance is called a And since TruffleRuby does that, I'm pretty sure JRuby can too, at very least for a Ruby-created thread only running Ruby code like the timeout thread.
Thread#kill should not execute
Good luck with that given the behavior of killing other threads on context close (or process shutdown) is standard for Ruby since many years.
The solution is clear, JRuby needs to kill other threads on context close, and join them. This is compatible with what CRuby does and it is already what TruffleRuby does. |
I closed this PR because I do not want alternatives. Literally, JRuby instance shutdown must do: Thread.list.each do |t|
next if t == Thread.main
t.kill
t.join
end And if it doesn't JRuby leaks threads, and that's then JRuby's problem. @headius What is the issue with doing that in JRuby? |
FWIW, this test passes on TruffleRuby: @Test
public void testKillThread() {
for (int i = 0; i < 100; i++) {
System.err.println(i);
try (Context context = Context.newBuilder("ruby")
.allowCreateThread(true)
.allowExperimentalOptions(true)
.option("ruby.single-threaded", "false")
.build()) {
final Value value = context.eval(
"ruby",
"Thread.new { sleep }; sleep 0.1; 42");
assertEquals(42, value.asInt());
}
for (Thread t : Thread.getAllStackTraces().keySet()) {
Assert.assertFalse(t.getName(), t.getName().startsWith("Ruby"));
}
}
} and is basically the same as the JRuby test in jruby/jruby@ebc0557 |
You are incorrect about the behavior of Thread#kill. It always runs ensure blocks:
I suspect you were confusing it with However, my primary point with this PR is that libraries that start up threads and hold resources should always make a best effort to tidy up those threads and resources at shutdown, given the impact to multi-runtime embedding use cases. Your changes mean that there's now one more dangling thread at termination time that must be forcibly killed rather than gracefully shutting itself down. It's untidy and unnecessary. The implementation changes I provided are also no more complicated than what you had.
These are good improvements to the library.
That's not your decision to make. I could have committed the changes directly, but I wanted to have a dialog with other maintainers about the best way to implement the shutdown logic. This is a shared project, used by all implementations and pretty much every Rubyist out there. We need to come to some agreement before accepting or rejecting reasonable changes. I still believe that a graceful shutdown is appropriate here, along with other refactoring. It may also be useful for users to be able to trigger the shutdown themselves, as part of reloaders and continuous deployment setups. |
Right, I was confusing with killing Fibers, which does not run
|
I guess I lost track of this because I'm trying to determine if CRuby uses something similar to shut down threads, ensures do not seem to fire for a sleeping thread at shutdown. Either they are not actively killing the thread, or they are doing it in a way that skips ensures. Do you recall where this happens in CRuby code? I thought I found it years ago, but can't locate the logic anymore. |
It does fire on CRuby, see my command and output above. |
You are correct, my local test was not giving the thread enough time to start and sleep. This means that threads may encounter a case where the timeout thread has been killed, but an ensure wants to run a timeout.
Depending on the shutdown sequence, a thread containing an ensure that does a timeout (for example, attempt to clean up some system resource but don't wait forever) may output an error and fail to run the timeout block, as I originally suggested. Unfortunately there's no way to fix this case, since CRuby disallows thread creation during this phase of shutdown. |
I think jruby/jruby#7351 solves this and makes any change here unnecessary, so I think we should close this. |
Regarding the potential issue during killing other threads in the shutdown sequence, |
Let's close this, it seems unnecessary and would add a non-trivial amount of complexity to solve problems that don't seem to happen in practice. |
The main thread created by the timeout library, after 0.3.0, does not provide any mechanism to cleanly shut down. This leads to failures in a JRuby test that launches a JRuby instance in an existing JVM process and confirms no additional threads remain alive after the instance is terminated. This behavior also differs from the old timeout library, which always terminated every thread it created (i.e. there were no background threads since a new thread was created and subsequently terminated for each timeout operation).
JRuby, like CRuby, does attempt to shut down all Ruby threads in a given instance before "exit", but it does this as a best-effort very late in the shutdown process, possibly not giving the timeout thread enough time to finalize at a native level. JRuby's previous timeout implementation was built as an extension using JDK concurrency APIs like Executor, and included a mechanism for shutting down the timeout Executor and its threads (using a mechanism similar to
at_exit
).The following patch adds an
at_exit
shutdown process for the timeout thread, to explicitly kill and wait for it to die. With this change, the test in question appears to pass more reliably: every test run out of 50 passed locally versus failing most of the time before.