Crash detector looks broken #252

Totktonada · 2020-12-15T15:28:39Z

System information

Linux.
test-run: e843552.
tarantool: 2.7.0-111-g28f3b2f1e.
python2: 2.7.17.
gevent: 1.5_alpha2 and 20.6.2.
greenlet: 0.4.15-r1 and 0.4.16.

How to observe the problem

Pull and build recent tarantool (2.7.0-111-g28f3b2f1e in my case).
Copy test/app-tap/test-timeout.test.lua from Add test_timeout to limit test run time #244 (comment) (don't forget to set the executable bit: chmod a+x test/app-tap/test-timeout.test.lua).

Mangle test/app-tap/debug/server.lua to fail after 1 second. Place this code at the end of the file:

local fiber = require('fiber')
fiber.create(function()
    fiber.sleep(1)
    os.exit(1)
end)

Run the test: ./test/test-run.py -j1 app-tap/test-timeout.test.lua.

Expected: the fail of the non-default server detected and the testing fails with appropriate message after ~1 second.

Got: fail after 120 seconds (default --no-output-timeout value), no report about the fail of the non-default server.

Investigation

I observed that self.process.returncode in TarantoolServer.crash_detect() is 0, while the process returns 1.

After any of the following two patches the exit code becomes correct.

Variant 1:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 481b08f..6624ccf 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -8,7 +8,7 @@ import re
 import shlex
 import shutil
 import signal
-import subprocess
+from gevent import subprocess
 import sys
 import time
 import yaml

Variant 2:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 481b08f..a26e7c6 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -928,7 +928,7 @@ class TarantoolServer(Server):
         while self.process.returncode is None:
             self.process.poll()
             if self.process.returncode is None:
-                gevent.sleep(0.1)
+                time.sleep(0.1)
 
         if self.process.returncode in [0, -signal.SIGKILL, -signal.SIGTERM]:
             return

test-run fails after this in app_server.py on if retval['returncode'] != 0, but nevermind, it'll be fixed soon. The tarantool instance that executes app-tap/test-timeout.test.lua hangs in the while true do end, but it will be fixed within the scope of #65 and #157.

It seems there is some problem around python's subprocess module and gevent module, but I failed to create a small reproducer.

The text was updated successfully, but these errors were encountered:

Totktonada · 2020-12-15T22:13:46Z

Interesting fact regarding from gevent import subprocess: when I did it, I start to receive EINTR from select() in the main thread.

How to reproduce:

$ ./test/test-run.py --test-timeout 5 -j16 $(yes app-tap/test-timeout.test.lua | head -n 2)

How to fix:

diff --git a/dispatcher.py b/dispatcher.py
index cedd355..ec58212 100644
--- a/dispatcher.py
+++ b/dispatcher.py
@@ -1,4 +1,5 @@
 import os
+import errno
 import signal
 import time
 import select
@@ -275,6 +276,11 @@ class Dispatcher:
             except KeyboardInterrupt:
                 self.flush_ready(inputs)
                 raise
+            except select.error as e:
+                if e[0] == errno.EINTR:
+                    continue
+                self.flush_ready(inputs)
+                raise
 
             objs = self.invoke_listeners(inputs, ready_inputs)
             for obj in objs:

It seems, it is due to SIGCHLD from a worker process.

Totktonada · 2020-12-15T23:17:38Z

NB: time.sleep(0.1) breaks --test-timeout for 'core = tarantool' tests, because this way we don't pass execution to gevent/greenlet scheduler.

It is possible that a crash detector of a non-default instance calls for `kill_current_test()` and the greenlet that executes the current test will be killed. So it'll not set the returncode. See #252 for reproducer. Part of #65 Part of #157

See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157

It is possible that a crash detector of a non-default instance calls for `kill_current_test()` and the greenlet that executes the current test will be killed. So it'll not set the returncode. See #252 for reproducer. Part of #65 Part of #157

See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157

It is possible that a crash detector of a non-default instance calls for `kill_current_test()` and the greenlet that executes the current test will be killed. So it'll not set the returncode. See #252 for reproducer. Part of #65 Part of #157

See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157

It is possible that a crash detector of a non-default instance calls for `kill_current_test()` and the greenlet that executes the current test will be killed. So it'll not set the returncode. See #252 for reproducer. Part of #65 Part of #157

See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157

Totktonada · 2021-06-06T11:07:52Z

Apart from the crash detector code, there is another place, where we spin with acquiring status of a process with time.sleep() / gevent.sleep(): the seek_wait() method. Noted here. I don't know, whether it is broken in the same way as the crash detector, so, let's check within this issue.

Totktonada added the bug Something isn't working label Dec 15, 2020

Totktonada changed the title ~~Crash detector seems being broken~~ Crash detector looks broken Dec 15, 2020

This was referenced Dec 15, 2020

Stop instances properly #257

Merged

tarantoolctl: fix broken crash detector in test-run tarantool/tarantool#5587

Closed

Totktonada mentioned this issue May 18, 2021

Add timeout for starting tarantool server #302

Merged

kyukhin added the teamQ label Oct 1, 2021

kyukhin added this to the wishlist milestone Oct 15, 2021

TarantoolBot removed the teamX label Jun 15, 2023

Totktonada mentioned this issue Jul 1, 2024

Helper server error return code does not fail the test for core = app #440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash detector looks broken #252

Crash detector looks broken #252

Totktonada commented Dec 15, 2020

Totktonada commented Dec 15, 2020

Totktonada commented Dec 15, 2020

Totktonada commented Jun 6, 2021

Crash detector looks broken #252

Crash detector looks broken #252

Comments

Totktonada commented Dec 15, 2020

System information

How to observe the problem

Investigation

Totktonada commented Dec 15, 2020

Totktonada commented Dec 15, 2020

Totktonada commented Jun 6, 2021