Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash detector looks broken #252

Open
Totktonada opened this issue Dec 15, 2020 · 3 comments
Open

Crash detector looks broken #252

Totktonada opened this issue Dec 15, 2020 · 3 comments
Labels
bug Something isn't working
Milestone

Comments

@Totktonada
Copy link
Member

System information

Linux.
test-run: e843552.
tarantool: 2.7.0-111-g28f3b2f1e.
python2: 2.7.17.
gevent: 1.5_alpha2 and 20.6.2.
greenlet: 0.4.15-r1 and 0.4.16.

How to observe the problem

  1. Pull and build recent tarantool (2.7.0-111-g28f3b2f1e in my case).

  2. Copy test/app-tap/test-timeout.test.lua from Add test_timeout to limit test run time #244 (comment) (don't forget to set the executable bit: chmod a+x test/app-tap/test-timeout.test.lua).

  3. Mangle test/app-tap/debug/server.lua to fail after 1 second. Place this code at the end of the file:

    local fiber = require('fiber')
    fiber.create(function()
        fiber.sleep(1)
        os.exit(1)
    end)
  4. Run the test: ./test/test-run.py -j1 app-tap/test-timeout.test.lua.

Expected: the fail of the non-default server detected and the testing fails with appropriate message after ~1 second.

Got: fail after 120 seconds (default --no-output-timeout value), no report about the fail of the non-default server.

Investigation

I observed that self.process.returncode in TarantoolServer.crash_detect() is 0, while the process returns 1.

After any of the following two patches the exit code becomes correct.

Variant 1:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 481b08f..6624ccf 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -8,7 +8,7 @@ import re
 import shlex
 import shutil
 import signal
-import subprocess
+from gevent import subprocess
 import sys
 import time
 import yaml

Variant 2:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index 481b08f..a26e7c6 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -928,7 +928,7 @@ class TarantoolServer(Server):
         while self.process.returncode is None:
             self.process.poll()
             if self.process.returncode is None:
-                gevent.sleep(0.1)
+                time.sleep(0.1)
 
         if self.process.returncode in [0, -signal.SIGKILL, -signal.SIGTERM]:
             return

test-run fails after this in app_server.py on if retval['returncode'] != 0, but nevermind, it'll be fixed soon. The tarantool instance that executes app-tap/test-timeout.test.lua hangs in the while true do end, but it will be fixed within the scope of #65 and #157.

It seems there is some problem around python's subprocess module and gevent module, but I failed to create a small reproducer.

@Totktonada Totktonada added the bug Something isn't working label Dec 15, 2020
@Totktonada Totktonada changed the title Crash detector seems being broken Crash detector looks broken Dec 15, 2020
@Totktonada
Copy link
Member Author

Interesting fact regarding from gevent import subprocess: when I did it, I start to receive EINTR from select() in the main thread.

How to reproduce:

$ ./test/test-run.py --test-timeout 5 -j16 $(yes app-tap/test-timeout.test.lua | head -n 2)

How to fix:

diff --git a/dispatcher.py b/dispatcher.py
index cedd355..ec58212 100644
--- a/dispatcher.py
+++ b/dispatcher.py
@@ -1,4 +1,5 @@
 import os
+import errno
 import signal
 import time
 import select
@@ -275,6 +276,11 @@ class Dispatcher:
             except KeyboardInterrupt:
                 self.flush_ready(inputs)
                 raise
+            except select.error as e:
+                if e[0] == errno.EINTR:
+                    continue
+                self.flush_ready(inputs)
+                raise
 
             objs = self.invoke_listeners(inputs, ready_inputs)
             for obj in objs:

It seems, it is due to SIGCHLD from a worker process.

@Totktonada
Copy link
Member Author

NB: time.sleep(0.1) breaks --test-timeout for 'core = tarantool' tests, because this way we don't pass execution to gevent/greenlet scheduler.

Totktonada added a commit that referenced this issue Dec 15, 2020
It is possible that a crash detector of a non-default instance calls for
`kill_current_test()` and the greenlet that executes the current test
will be killed. So it'll not set the returncode.

See #252 for reproducer.

Part of #65
Part of #157
Totktonada added a commit that referenced this issue Dec 15, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
avtikhon pushed a commit that referenced this issue Dec 16, 2020
It is possible that a crash detector of a non-default instance calls for
`kill_current_test()` and the greenlet that executes the current test
will be killed. So it'll not set the returncode.

See #252 for reproducer.

Part of #65
Part of #157
avtikhon pushed a commit that referenced this issue Dec 16, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
avtikhon pushed a commit that referenced this issue Dec 16, 2020
It is possible that a crash detector of a non-default instance calls for
`kill_current_test()` and the greenlet that executes the current test
will be killed. So it'll not set the returncode.

See #252 for reproducer.

Part of #65
Part of #157
avtikhon pushed a commit that referenced this issue Dec 16, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
Totktonada added a commit that referenced this issue Dec 18, 2020
It is possible that a crash detector of a non-default instance calls for
`kill_current_test()` and the greenlet that executes the current test
will be killed. So it'll not set the returncode.

See #252 for reproducer.

Part of #65
Part of #157
Totktonada added a commit that referenced this issue Dec 18, 2020
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
@Totktonada
Copy link
Member Author

Apart from the crash detector code, there is another place, where we spin with acquiring status of a process with time.sleep() / gevent.sleep(): the seek_wait() method. Noted here. I don't know, whether it is broken in the same way as the crash detector, so, let's check within this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants