Add the 'timeout' option into 'stop server' command and kill a server… #186

rtokarev · 2019-08-19T08:15:07Z

… with

'SIGKILL' if it doesn't finished before timeout is expired..

By default drop_cluster() routine uses SIGTERM signal to stop the
replications. Found that in some situations SIGTERM couldn't kill
all instances on OSX and some processes left. To avoid of such
situations need additionally to send SIGKILL signal to all instances
that were not finished before timeout was expired to be able to stop
them all.

avtikhon · 2020-11-27T05:48:27Z

PR successfully checked on multi runs in CI:
https://gitlab.com/tarantool/tarantool/-/pipelines?page=1&scope=all&ref=avtikhon%2Fflaky_stable_min_prs

Also it better resolves issues:

with dropping cluster than PR Use SIGKILL in drop_cluster() routine #232, because it fixes the issue in core if of the function, and it will be used also in python depending calls, while PR Use SIGKILL in drop_cluster() routine #232 only fixes it for single call to stop cluster in drop_cluster() routine.
it fixes for OSX the same issue that PR Move tarantoolctl to test-run tool submodule and add replication_sync_timeout to it #242 fixes for Linux only, because it stops the instances hanging on OSX while tarantoolctl could not do it in PR Move tarantoolctl to test-run tool submodule and add replication_sync_timeout to it #242.

Patch LGTM.

Totktonada · 2020-11-29T02:23:22Z

Part of #157 (solves the problem for 'core = tarantool' test suites).

When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245

Totktonada · 2020-11-29T23:39:01Z

This PR would be helpful for us to overcome https://github.com/tarantool/tarantool/issues/5573, that hit us in testing from time to time.

Now we have only one place, where a [QA Notice] message is printed. But we'll have more and it is convenient to have one function for this purpose. The function should be used in places, where we observe an incorrect behaviour of tarantool or a test, but want to lower it to a warning and proceed further. Of course, there should be a solid reason to do so. Part of tarantool#157

If a tarantool instance does not handle SIGTERM correctly, it is the bug in tarantool. A testing system should reveal problems rather than hide them. However this patch doing that and I'll explain why. Many tests we have perform instance starting / stopping: they are written to spot different problems, but not this one. If everything else except hanging at stop is okay, there is no reason to fail the test. We print a notice on the terminal in the case and proceed further. In future, when all problems of this kind will be resolved, we can interpret presence of such notices as error. However now it'll not add anything for quality of our testing. We had a problem of this kind ([1]) in the past and now have another one ([2]). Until it will be resolved, it worth to handle the situation on the testing system side. [1]: tarantool/tarantool#4127 [2]: https://github.com/tarantool/tarantool/issues/5573 Part of tarantool#157 Co-authored-by: Alexander Turenko <[email protected]>

Totktonada · 2020-11-30T01:41:45Z

Updated the patch:

Rebased on the latest master.
Added a patch with the qa_notice() function that I want to use here.
Issue a QA notice in case of a stuck tarantool.
Define the kill() function within the stop() one, because the warning becomes specific for the situation in stop().
Removed timeout parameter from stop(), because it is not used anywhere for now.
Written more about motivation in the commit message and added myself to co-authors.

When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245

Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316

When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245

Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316

avtikhon · 2020-11-30T05:02:26Z

The patch successfully checked at [1], LGTM:

[1] - https://gitlab.com/tarantool/tarantool/-/pipelines/222917874