-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the 'timeout' option into 'stop server' command and kill a server… #186
Add the 'timeout' option into 'stop server' command and kill a server… #186
Conversation
PR successfully checked on multi runs in CI: Also it better resolves issues:
Patch LGTM. |
Part of #157 (solves the problem for 'core = tarantool' test suites). |
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
This PR would be helpful for us to overcome https://github.com/tarantool/tarantool/issues/5573, that hit us in testing from time to time. |
Now we have only one place, where a [QA Notice] message is printed. But we'll have more and it is convenient to have one function for this purpose. The function should be used in places, where we observe an incorrect behaviour of tarantool or a test, but want to lower it to a warning and proceed further. Of course, there should be a solid reason to do so. Part of tarantool#157
0aa92c7
to
5d69357
Compare
If a tarantool instance does not handle SIGTERM correctly, it is the bug in tarantool. A testing system should reveal problems rather than hide them. However this patch doing that and I'll explain why. Many tests we have perform instance starting / stopping: they are written to spot different problems, but not this one. If everything else except hanging at stop is okay, there is no reason to fail the test. We print a notice on the terminal in the case and proceed further. In future, when all problems of this kind will be resolved, we can interpret presence of such notices as error. However now it'll not add anything for quality of our testing. We had a problem of this kind ([1]) in the past and now have another one ([2]). Until it will be resolved, it worth to handle the situation on the testing system side. [1]: tarantool/tarantool#4127 [2]: https://github.com/tarantool/tarantool/issues/5573 Part of tarantool#157 Co-authored-by: Alexander Turenko <[email protected]>
5d69357
to
cf397eb
Compare
Updated the patch:
|
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
The patch successfully checked at [1], LGTM: [1] - https://gitlab.com/tarantool/tarantool/-/pipelines/222917874 |
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx The fix resolves the issue #157 together with PR #186, which helps to kill the instances when SIGTERM couldn't do it. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
When there is no output from workers during a long time (10 seconds by default or 60 seconds when --long argument is passed), test-run prints a warning and shows amount of lines in the temporary result file. It is useful to understand on which statement a test hungs. I reproduced the problem, when mangled tarantool to ignore SIGTERM and SIGINT signals and run a simple 'tarantool = core' test. The test successfully passes, but the worker stucks in waiting for stopping the tarantool server. This particular case should be resolved in PR #186, but just because the timeout for stopping the server is less than the warning delay. This assumption looks fragile, especially if we'll want to make some of those timeouts / delays configurable. Let's handle the situation when the file does not exist. Found while looking into https://github.com/tarantool/tarantool/issues/5573 Fixes #245
Limit waiting for a tarantool process termination by 5 seconds. When this timeout exceeded, print a warning to the terminal and send SIGKILL to the process. We need to handle the situation with a stuck tarantool process on the testing system side to overcome a problem of this kind that appears on Mac OS (see #5573). This changeset handles one particular case: stopping of a tarantool instance that either started for execution of a 'core = tarantool' test suite or started from a test using the `test_run:cmd('start server foo')` command. It does not handle stopping of tarantool that is started for execution of a 'core = app' test or started from a test directly using io.popen() or built-in 'popen' module. Related to #5573 Part of tarantool/test-run#157 The changeset: tarantool/test-run#186
Limit waiting for a tarantool process termination by 5 seconds. When this timeout exceeded, print a warning to the terminal and send SIGKILL to the process. We need to handle the situation with a stuck tarantool process on the testing system side to overcome a problem of this kind that appears on Mac OS (see #5573). This changeset handles one particular case: stopping of a tarantool instance that either started for execution of a 'core = tarantool' test suite or started from a test using the `test_run:cmd('start server foo')` command. It does not handle stopping of tarantool that is started for execution of a 'core = app' test or started from a test directly using io.popen() or built-in 'popen' module. Related to #5573 Part of tarantool/test-run#157 The changeset: tarantool/test-run#186 (cherry picked from commit 24f57a3)
Limit waiting for a tarantool process termination by 5 seconds. When this timeout exceeded, print a warning to the terminal and send SIGKILL to the process. We need to handle the situation with a stuck tarantool process on the testing system side to overcome a problem of this kind that appears on Mac OS (see #5573). This changeset handles one particular case: stopping of a tarantool instance that either started for execution of a 'core = tarantool' test suite or started from a test using the `test_run:cmd('start server foo')` command. It does not handle stopping of tarantool that is started for execution of a 'core = app' test or started from a test directly using io.popen() or built-in 'popen' module. Related to #5573 Part of tarantool/test-run#157 The changeset: tarantool/test-run#186 (cherry picked from commit 24f57a3)
Limit waiting for a tarantool process termination by 5 seconds. When this timeout exceeded, print a warning to the terminal and send SIGKILL to the process. We need to handle the situation with a stuck tarantool process on the testing system side to overcome a problem of this kind that appears on Mac OS (see #5573). This changeset handles one particular case: stopping of a tarantool instance that either started for execution of a 'core = tarantool' test suite or started from a test using the `test_run:cmd('start server foo')` command. It does not handle stopping of tarantool that is started for execution of a 'core = app' test or started from a test directly using io.popen() or built-in 'popen' module. Related to #5573 Part of tarantool/test-run#157 The changeset: tarantool/test-run#186 (cherry picked from commit 24f57a3)
Updated the test-run submodule in tarantool in the following commits: 2.7.0-83-g24f57a353, 2.6.1-70-g5e660f7cd, 2.5.2-44-g85eb5eea6, 1.10.8-31-g1e01031e9. |
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx This commit finally resolves the problem, when a stuck tarantool is not terminated after SIGTERM. The PR #186 fixes the first part: TarantoolServer. Now AppServer case is handled too. Fixes #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx This commit finally resolves the problem, when a stuck tarantool is not terminated after SIGTERM. The PR #186 fixes the first part: TarantoolServer. Now AppServer case is handled too. Fixes #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
… with
'SIGKILL' if it doesn't finished before timeout is expired..
By default drop_cluster() routine uses SIGTERM signal to stop the
replications. Found that in some situations SIGTERM couldn't kill
all instances on OSX and some processes left. To avoid of such
situations need additionally to send SIGKILL signal to all instances
that were not finished before timeout was expired to be able to stop
them all.