-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash detector looks broken #252
Comments
Interesting fact regarding How to reproduce: $ ./test/test-run.py --test-timeout 5 -j16 $(yes app-tap/test-timeout.test.lua | head -n 2) How to fix: diff --git a/dispatcher.py b/dispatcher.py
index cedd355..ec58212 100644
--- a/dispatcher.py
+++ b/dispatcher.py
@@ -1,4 +1,5 @@
import os
+import errno
import signal
import time
import select
@@ -275,6 +276,11 @@ class Dispatcher:
except KeyboardInterrupt:
self.flush_ready(inputs)
raise
+ except select.error as e:
+ if e[0] == errno.EINTR:
+ continue
+ self.flush_ready(inputs)
+ raise
objs = self.invoke_listeners(inputs, ready_inputs)
for obj in objs: It seems, it is due to SIGCHLD from a worker process. |
NB: |
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
Apart from the crash detector code, there is another place, where we spin with acquiring status of a process with time.sleep() / gevent.sleep(): the |
System information
Linux.
test-run: e843552.
tarantool: 2.7.0-111-g28f3b2f1e.
python2: 2.7.17.
gevent: 1.5_alpha2 and 20.6.2.
greenlet: 0.4.15-r1 and 0.4.16.
How to observe the problem
Pull and build recent tarantool (2.7.0-111-g28f3b2f1e in my case).
Copy
test/app-tap/test-timeout.test.lua
from Add test_timeout to limit test run time #244 (comment) (don't forget to set the executable bit:chmod a+x test/app-tap/test-timeout.test.lua
).Mangle
test/app-tap/debug/server.lua
to fail after 1 second. Place this code at the end of the file:Run the test:
./test/test-run.py -j1 app-tap/test-timeout.test.lua
.Expected: the fail of the non-default server detected and the testing fails with appropriate message after ~1 second.
Got: fail after 120 seconds (default --no-output-timeout value), no report about the fail of the non-default server.
Investigation
I observed that
self.process.returncode
inTarantoolServer.crash_detect()
is 0, while the process returns 1.After any of the following two patches the exit code becomes correct.
Variant 1:
Variant 2:
test-run fails after this in app_server.py on
if retval['returncode'] != 0
, but nevermind, it'll be fixed soon. The tarantool instance that executesapp-tap/test-timeout.test.lua
hangs in thewhile true do end
, but it will be fixed within the scope of #65 and #157.It seems there is some problem around python's subprocess module and gevent module, but I failed to create a small reproducer.
The text was updated successfully, but these errors were encountered: