-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test_timeout to limit test run time #244
Conversation
a45a988
to
16ff53d
Compare
Made all changes as you suggested, please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea.
16ff53d
to
258d81c
Compare
I rebased the PR on top of the current master and fixed the problem with mixed temporary result files. I still see problems around stopping non-default instances (both for core = 'tarantool' and core = 'app' tests). I'll looks around those problems and likely will fix them in the following PRs. However I'll investigate them first, only after this I'll able to say whether this PR is okay to push. |
TL;DR: I found several problems around killing and waiting of instances and I'll prepare fixes for them. I don't know whether it is better to let them be in a separate PR or add here: I'll think. However the current patch definitely doesn't close #157, since I'm able to get stuck instances. Anyway, the PR still on my side. Simple test cases: $ cat test/app-tap/test-timeout.test.lua
#!/usr/bin/env tarantool
local tap = require('tap')
local test_run = require('test_run').new()
test_run:cmd("create server foo with script='app-tap/debug/server.lua'")
test_run:cmd('start server foo')
local test = tap.test('test-timeout')
test:plan(1)
while true do end
test:ok('passed')
test_run:cmd('stop server foo')
test_run:cmd('cleanup server foo')
test_run:cmd('delete server foo')
$ cat test/app/test-timeout.test.lua
test_run = require('test_run').new()
SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
test_run:wait_fullmesh(SERVERS)
while true do end
test_run:drop_cluster(SERVERS)
$ cat test/app/test-timeout.result
-- test-run result file version 2
test_run = require('test_run').new()
| ---
| ...
SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
| ---
| ...
test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
| ---
| ...
test_run:wait_fullmesh(SERVERS)
| ---
| ...
test_run:drop_cluster(SERVERS)
| ---
| ... Run app/test-timeout.test.lua: $ ./test/test-run.py --test-timeout 5 -j16 --force $(yes app/test-timeout.test.lua | head -n 100)
<...>
[002] [Instance "election_replica1" returns with non-zero exit code: 1]
<...>
[002] Starting instance election_replica1...
[002] The daemon is already running: PID 22915
[002] [ fail ]
$ ps ax | grep [t]arantool
<empty> The reason is lack of waiting of a process to stop in killall_servers(), so the diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index bdbc935..2d8ed34 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -335,37 +335,6 @@ class LuaTest(Test):
# one.
ts.stop_nondefault()
- def killall_servers(self, server, ts, crash_occured):
- """ kill all servers and crash detectors before stream swap """
- color_log('Kill all servers ...\n', schema='info')
- server_list = ts.servers.values() + [server, ]
- check_list = [s for s in server_list if 'process' in s.__dict__]
-
- # check that all servers stopped correctly
- for server in check_list:
- bad_returncode = server.process.returncode not in (None,
- 0,
- -signal.SIGKILL,
- -signal.SIGTERM)
- # if non-default server crashed but it was expected
- # don't kill the default server and crash detectors
- crash_occured = crash_occured or \
- bad_returncode and not server.crash_expected
-
- for server in check_list:
- server.process.poll()
-
- if crash_occured:
- # kill all servers and crash detectors on crash
- if server.process.returncode is None:
- server.process.kill()
- if server.crash_detector is not None:
- gevent.kill(server.crash_detector)
- elif server.process.returncode is not None:
- # join crash detectors of stopped servers
- if server.crash_detector is not None:
- save_join(server.crash_detector)
-
def execute(self, server):
super(LuaTest, self).execute(server)
cls_name = server.__class__.__name__.lower()
@@ -386,21 +355,19 @@ class LuaTest(Test):
lua = TestRunGreenlet(self.exec_loop, ts)
self.current_test_greenlet = lua
lua.start()
- crash_occured = True
try:
- crash_occured = save_join(lua, timeout=Options().args.test_timeout)
- self.killall_servers(server, ts, crash_occured)
+ save_join(lua, timeout=Options().args.test_timeout)
except KeyboardInterrupt:
# prevent tests greenlet from writing to the real stdout
lua.kill()
-
- ts.stop_nondefault()
raise
except TarantoolStartError as e:
color_stdout('\n[Instance "{0}"] Failed to start tarantool '
'instance "{1}"\n'.format(server.name, e.name),
schema='error')
server.kill_current_test()
+ finally:
+ ts.stop_nondefault()
class PythonTest(Test): Run app-tap/test-timeout.test.lua: $ ./test/test-run.py --test-timeout 5 -j16 --force $(yes app-tap/test-timeout.test.lua | head -n 100)
<...>
[016] app-tap/test-timeout.test.lua
[016] Test.run() received the following error:
[016] Traceback (most recent call last):
[016] File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test.py", line 182, in run
[016] self.execute(server)
[016] File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 73, in execute
[016] if retval['returncode'] != 0:
[016] KeyError: 'returncode'
[016]
[016] [ fail ]
<...>
$ ps ax | grep [t]arantool
19320 pts/2 Sl 0:00 tarantool server.lua <running>
<..and 15 more like it..>
19543 pts/2 Rl 0:02 tarantool /home/alex/projects/tarantool-meta/r/t-3/test/app-tap/test-timeout.test.lua
<..and 49 more like it..> The reason is the same 'The daemon is already running' error, which can be found in logs. The following patch fixes the problem: diff --git a/lib/app_server.py b/lib/app_server.py
index a02200b..de485d2 100644
--- a/lib/app_server.py
+++ b/lib/app_server.py
@@ -70,7 +70,9 @@ class AppTest(Test):
except TarantoolStartError:
# A non-default server failed to start.
raise TestExecutionError
- if retval['returncode'] != 0:
+ finally:
+ ts.stop_nondefault()
+ if retval.get('returncode', None) != 0:
raise TestExecutionError |
Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx This commit implements terminating of stuck AppServer instances by SIGKILL. However there are still problems regarding stopping and waiting of non-default instances. They will be resolved in the following PRs / commits. See PR #244 for details. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
258d81c
to
72ca19a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the patch.
As the tests above show, it does not resolve #157, so I changed the PR description and the commit message accordingly.
I'll send the corresponding fixes in a separate PR. I'll not update the submodule in tarantool after this patch. But I'll update it after the fixes.
Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157
If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157
Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157
If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157
Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157
If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157
Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157
If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157
Updated the test-run submodule in tarantool in 2.7.0-116-g6e6e7b29b, 2.6.1-101-g13ff2ff2b, 2.5.2-65-gcdc10c888, 1.10.8-44-g0e95810dd. |
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).
This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.
To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:
./test-run.py replication/quorum.test.lua --test-timeout 5
--no-output-timeout 10 --conf memtx
This commit implements terminating of stuck AppServer instances by
SIGKILL. However there are still problems regarding stopping and waiting
of non-default instances. They will be resolved in the following
commits. See PR #244 for details.
Part of #157
[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316