Stop instances properly #257

Totktonada · 2020-12-15T23:51:06Z

Attempt to get rid of 'The daemon is already running' error and hang instances after testing.

Wait for a process to terminate after a signal to get rid of the 'The daemon is already running' error.
Stop supplementary (non-default) instances after an app test.
Use SIGKILL, when SIGTERM is not sufficient for non-default instance of an app test (follows up Add the 'timeout' option into 'stop server' command and kill a server… #186).
Kill an app server if it hangs and a non-default instance fails.

I use the test cases from the PR #244 and tweak them to hang or fail instances at different points: before startup or after. Say, hang a default app server and fail a supplementary one after startup.

Follows up PR #244.
Fixes #65.
Fixes #157.

While looking into this, found several doubtful situations or code: #252, #253, #254, #255, #256.

Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157

If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157

This way it easier to identify cause of a particular message about stopping a server. A supplementary instance that is started using the `test_run:cmd('start server foo')` command is called non-default. In contrast, the instance that executes test's commands by default is called 'default'. test-run stops non-default servers after each test. test-run cleans up non-default servers (removes xlogs, snapshots) after a successful test or after each test when --force is passed. Part of #65 Part of #157

This way we decrease probability to reach --no-output-timeout after --test-timeout. See the comment in LuaTest.execute() for details. Part of #65 Part of #157

It is possible that a crash detector of a non-default instance calls for `kill_current_test()` and the greenlet that executes the current test will be killed. So it'll not set the returncode. See #252 for reproducer. Part of #65 Part of #157

`send_signal()`, `terminate()` or `kill()` methods of a subprocess.Popen object raise OSError on Python 2, when the process does not exist and Python knows about it. See: | >>> import subprocess | >>> p = subprocess.Popen(['true']) | >>> p.poll() | 0 | >>> p.kill() | Traceback (most recent call last): | File "<stdin>", line 1, in <module> | File "/usr/lib64/python2.7/subprocess.py", line 1279, in kill | self.send_signal(signal.SIGKILL) | File "/usr/lib64/python2.7/subprocess.py", line 1269, in send_signal | os.kill(self.pid, sig) | OSError: [Errno 3] No such process Python 3 does not raise the exception in the case. Let's protect ourself from possible races / behaviour differences and silence the exception. Part of #157

See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157

LeonidVas

LGTM.

Totktonada · 2020-12-19T16:25:15Z

Updated the test-run submodule in tarantool in 2.7.0-116-g6e6e7b29b, 2.6.1-101-g13ff2ff2b, 2.5.2-65-gcdc10c888, 1.10.8-44-g0e95810dd.

Totktonada added 7 commits December 8, 2020 10:56

Send SIGKILL right away for residual instances

2e5aebb

This way we decrease probability to reach --no-output-timeout after --test-timeout. See the comment in LuaTest.execute() for details. Part of #65 Part of #157

Totktonada added the bug Something isn't working label Dec 15, 2020

Totktonada requested review from avtikhon and LeonidVas December 15, 2020 23:51

Totktonada mentioned this pull request Dec 15, 2020

tarantoolctl: fix broken crash detector in test-run tarantool/tarantool#5587

Closed

avtikhon approved these changes Dec 16, 2020

View reviewed changes

LeonidVas approved these changes Dec 18, 2020

View reviewed changes

Totktonada merged commit 584e273 into master Dec 18, 2020

Totktonada deleted the Totktonada/stop-instances-properly branch December 18, 2020 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop instances properly #257

Stop instances properly #257

Totktonada commented Dec 15, 2020

LeonidVas left a comment

Totktonada commented Dec 19, 2020

Stop instances properly #257

Stop instances properly #257

Conversation

Totktonada commented Dec 15, 2020

LeonidVas left a comment

Choose a reason for hiding this comment

Totktonada commented Dec 19, 2020