Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop instances properly #257

Merged
merged 7 commits into from
Dec 18, 2020
Merged

Conversation

Totktonada
Copy link
Member

Attempt to get rid of 'The daemon is already running' error and hang instances after testing.

  • Wait for a process to terminate after a signal to get rid of the 'The daemon is already running' error.
  • Stop supplementary (non-default) instances after an app test.
  • Use SIGKILL, when SIGTERM is not sufficient for non-default instance of an app test (follows up Add the 'timeout' option into 'stop server' command and kill a server… #186).
  • Kill an app server if it hangs and a non-default instance fails.

I use the test cases from the PR #244 and tweak them to hang or fail instances at different points: before startup or after. Say, hang a default app server and fail a supplementary one after startup.

Follows up PR #244.
Fixes #65.
Fixes #157.


While looking into this, found several doubtful situations or code: #252, #253, #254, #255, #256.

Otherwise, if a next test will try to start a new instance with the
same name, it may fail with the 'The daemon is already running' error,
if a previous test fails or reaches --test-timeout.

The commit fixes the problem for 'core = tarantool' test suites. The
next commit will resolve it for 'core = app' test suites.

See a test case in PR #244 (app/test-timeout.test.lua).

A few words about details of the implementation. The killall_servers()
method performs almost the same actions that stop_nondefault(), so I
kept the latter and removed the former. However there are differences:

- killall_servers() does not wait for termination of processes. This is
  why the change fixes given problem.
- killall_servers() sends SIGKILL, while stop_nondefault() stop
  instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for
  termination. I'll return instant SIGKILL in one of the following
  commits: I splitted the changes for ease reading.

Part of #157
If a test fails or reaches --test-timeout, some non-default tarantool
instances may be left running and so if a next test starts an instance
with the same name, it may fail with the 'The daemon is already running'
error.

This change follows the approach of the previous commit ('Wait until
residual servers will be stopped').

See a test case in PR #244 (app-tap/test-timeout.test.lua).

Part of #65
Part of #157
This way it easier to identify cause of a particular message about
stopping a server.

A supplementary instance that is started using the `test_run:cmd('start
server foo')` command is called non-default. In contrast, the instance
that executes test's commands by default is called 'default'.

test-run stops non-default servers after each test.

test-run cleans up non-default servers (removes xlogs, snapshots) after
a successful test or after each test when --force is passed.

Part of #65
Part of #157
This way we decrease probability to reach --no-output-timeout after
--test-timeout. See the comment in LuaTest.execute() for details.

Part of #65
Part of #157
It is possible that a crash detector of a non-default instance calls for
`kill_current_test()` and the greenlet that executes the current test
will be killed. So it'll not set the returncode.

See #252 for reproducer.

Part of #65
Part of #157
`send_signal()`, `terminate()` or `kill()` methods of a subprocess.Popen
object raise OSError on Python 2, when the process does not exist and
Python knows about it. See:

 | >>> import subprocess
 | >>> p = subprocess.Popen(['true'])
 | >>> p.poll()
 | 0
 | >>> p.kill()
 | Traceback (most recent call last):
 |   File "<stdin>", line 1, in <module>
 |   File "/usr/lib64/python2.7/subprocess.py", line 1279, in kill
 |     self.send_signal(signal.SIGKILL)
 |   File "/usr/lib64/python2.7/subprocess.py", line 1269, in send_signal
 |     os.kill(self.pid, sig)
 | OSError: [Errno 3] No such process

Python 3 does not raise the exception in the case. Let's protect ourself
from possible races / behaviour differences and silence the exception.

Part of #157
See a test case in #252. The idea is the following: the default server
(it is the app server) executes a script, which starts a non-default
server and hangs. The non-default instance fails after some time (before
--test-timeout), test-run's crash detector observed this and stops the
test (kills the greenlet). We kill all other non-default servers in the
case, but prior to this commit, we don't kill the app server itself.

Regarding the implementation. I updated AppServer.stop() to follow the
way how TarantoolServer.stop() works: verbose logging, use SIGKILL after
a timeout, wait for a process termination. And reused this code in the
finalization of an app test.

Introduced the AppTest.teardown() method. The idea is to have a method,
where all finalization after AppTest.execute() occurs. Maybe we'll
refactor process management in future and presence an explicit
finalization method will make things more clear.

I think, it is good point to close #65 and #157: most of cases, where
tarantool instances are not terminated or killed, are fixed. There is
kill_old_server(), which only sends SIGTERM, but we should not find
alive instances here after the 'Wait until residual servers will be
stopped' commit. There is #256, but it unlikely will hit us on real
tests. If there will be other cases of this kind, we should handle them
separately: file issue with a reproducer, investigate and fix them. I
see no reason to track 'kill hung server' as the general task anymore.

Fixes #65
Fixes #157
Copy link

@LeonidVas LeonidVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Totktonada Totktonada merged commit 584e273 into master Dec 18, 2020
@Totktonada Totktonada deleted the Totktonada/stop-instances-properly branch December 18, 2020 22:46
@Totktonada
Copy link
Member Author

Updated the test-run submodule in tarantool in 2.7.0-116-g6e6e7b29b, 2.6.1-101-g13ff2ff2b, 2.5.2-65-gcdc10c888, 1.10.8-44-g0e95810dd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants