-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop instances properly #257
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157
If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157
This way it easier to identify cause of a particular message about stopping a server. A supplementary instance that is started using the `test_run:cmd('start server foo')` command is called non-default. In contrast, the instance that executes test's commands by default is called 'default'. test-run stops non-default servers after each test. test-run cleans up non-default servers (removes xlogs, snapshots) after a successful test or after each test when --force is passed. Part of #65 Part of #157
`send_signal()`, `terminate()` or `kill()` methods of a subprocess.Popen object raise OSError on Python 2, when the process does not exist and Python knows about it. See: | >>> import subprocess | >>> p = subprocess.Popen(['true']) | >>> p.poll() | 0 | >>> p.kill() | Traceback (most recent call last): | File "<stdin>", line 1, in <module> | File "/usr/lib64/python2.7/subprocess.py", line 1279, in kill | self.send_signal(signal.SIGKILL) | File "/usr/lib64/python2.7/subprocess.py", line 1269, in send_signal | os.kill(self.pid, sig) | OSError: [Errno 3] No such process Python 3 does not raise the exception in the case. Let's protect ourself from possible races / behaviour differences and silence the exception. Part of #157
See a test case in #252. The idea is the following: the default server (it is the app server) executes a script, which starts a non-default server and hangs. The non-default instance fails after some time (before --test-timeout), test-run's crash detector observed this and stops the test (kills the greenlet). We kill all other non-default servers in the case, but prior to this commit, we don't kill the app server itself. Regarding the implementation. I updated AppServer.stop() to follow the way how TarantoolServer.stop() works: verbose logging, use SIGKILL after a timeout, wait for a process termination. And reused this code in the finalization of an app test. Introduced the AppTest.teardown() method. The idea is to have a method, where all finalization after AppTest.execute() occurs. Maybe we'll refactor process management in future and presence an explicit finalization method will make things more clear. I think, it is good point to close #65 and #157: most of cases, where tarantool instances are not terminated or killed, are fixed. There is kill_old_server(), which only sends SIGTERM, but we should not find alive instances here after the 'Wait until residual servers will be stopped' commit. There is #256, but it unlikely will hit us on real tests. If there will be other cases of this kind, we should handle them separately: file issue with a reproducer, investigate and fix them. I see no reason to track 'kill hung server' as the general task anymore. Fixes #65 Fixes #157
avtikhon
approved these changes
Dec 16, 2020
LeonidVas
approved these changes
Dec 18, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Updated the test-run submodule in tarantool in 2.7.0-116-g6e6e7b29b, 2.6.1-101-g13ff2ff2b, 2.5.2-65-gcdc10c888, 1.10.8-44-g0e95810dd. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Attempt to get rid of 'The daemon is already running' error and hang instances after testing.
I use the test cases from the PR #244 and tweak them to hang or fail instances at different points: before startup or after. Say, hang a default app server and fail a supplementary one after startup.
Follows up PR #244.
Fixes #65.
Fixes #157.
While looking into this, found several doubtful situations or code: #252, #253, #254, #255, #256.