Add test_timeout to limit test run time #244

avtikhon · 2020-11-26T13:24:15Z

Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

./test-run.py replication/quorum.test.lua --test-timeout 5
--no-output-timeout 10 --conf memtx

This commit implements terminating of stuck AppServer instances by
SIGKILL. However there are still problems regarding stopping and waiting
of non-default instances. They will be resolved in the following
commits. See PR #244 for details.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316

lib/tarantool_server.py

lib/app_server.py

Totktonada · 2020-11-30T02:19:58Z

Together with PR #186 is should resolve #157. Please, reflect this fact in the commit message.

avtikhon · 2020-11-30T04:58:54Z

Made all changes as you suggested, please review.

lib/tarantool_server.py

LeonidVas

I think this is a good idea.

lib/app_server.py

lib/tarantool_server.py

Totktonada · 2020-12-06T10:27:20Z

I rebased the PR on top of the current master and fixed the problem with mixed temporary result files. I still see problems around stopping non-default instances (both for core = 'tarantool' and core = 'app' tests). I'll looks around those problems and likely will fix them in the following PRs. However I'll investigate them first, only after this I'll able to say whether this PR is okay to push.

Totktonada · 2020-12-07T02:41:12Z

TL;DR: I found several problems around killing and waiting of instances and I'll prepare fixes for them. I don't know whether it is better to let them be in a separate PR or add here: I'll think. However the current patch definitely doesn't close #157, since I'm able to get stuck instances.

Anyway, the PR still on my side.

Simple test cases:

$ cat test/app-tap/test-timeout.test.lua
#!/usr/bin/env tarantool

local tap = require('tap')
local test_run = require('test_run').new()

test_run:cmd("create server foo with script='app-tap/debug/server.lua'")
test_run:cmd('start server foo')

local test = tap.test('test-timeout')
test:plan(1)

while true do end

test:ok('passed')

test_run:cmd('stop server foo')
test_run:cmd('cleanup server foo')
test_run:cmd('delete server foo')

$ cat test/app/test-timeout.test.lua
test_run = require('test_run').new()

SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
test_run:wait_fullmesh(SERVERS)

while true do end

test_run:drop_cluster(SERVERS)

$ cat test/app/test-timeout.result
-- test-run result file version 2
test_run = require('test_run').new()
 | ---
 | ...

SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
 | ---
 | ...
test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
 | ---
 | ...
test_run:wait_fullmesh(SERVERS)
 | ---
 | ...

test_run:drop_cluster(SERVERS)
 | ---
 | ...

Run app/test-timeout.test.lua:

$ ./test/test-run.py --test-timeout 5 -j16 --force $(yes app/test-timeout.test.lua | head -n 100)
<...>
[002] [Instance "election_replica1" returns with non-zero exit code: 1]
<...>
[002] Starting instance election_replica1...
[002] The daemon is already running: PID 22915
[002] [ fail ]

$ ps ax | grep [t]arantool
<empty>

The reason is lack of waiting of a process to stop in killall_servers(), so the The daemon is already running error occurs. However I don't see any reason to duplicate the logic of TestState.stop_nondefault(), which is called at end of successful execution of a test. So, I would fix it this way:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index bdbc935..2d8ed34 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -335,37 +335,6 @@ class LuaTest(Test):
         # one.
         ts.stop_nondefault()
 
-    def killall_servers(self, server, ts, crash_occured):
-        """ kill all servers and crash detectors before stream swap """
-        color_log('Kill all servers ...\n', schema='info')
-        server_list = ts.servers.values() + [server, ]
-        check_list = [s for s in server_list if 'process' in s.__dict__]
-
-        # check that all servers stopped correctly
-        for server in check_list:
-            bad_returncode = server.process.returncode not in (None,
-                                                               0,
-                                                               -signal.SIGKILL,
-                                                               -signal.SIGTERM)
-            # if non-default server crashed but it was expected
-            # don't kill the default server and crash detectors
-            crash_occured = crash_occured or \
-                bad_returncode and not server.crash_expected
-
-        for server in check_list:
-            server.process.poll()
-
-            if crash_occured:
-                # kill all servers and crash detectors on crash
-                if server.process.returncode is None:
-                    server.process.kill()
-                if server.crash_detector is not None:
-                    gevent.kill(server.crash_detector)
-            elif server.process.returncode is not None:
-                # join crash detectors of stopped servers
-                if server.crash_detector is not None:
-                    save_join(server.crash_detector)
-
     def execute(self, server):
         super(LuaTest, self).execute(server)
         cls_name = server.__class__.__name__.lower()
@@ -386,21 +355,19 @@ class LuaTest(Test):
         lua = TestRunGreenlet(self.exec_loop, ts)
         self.current_test_greenlet = lua
         lua.start()
-        crash_occured = True
         try:
-            crash_occured = save_join(lua, timeout=Options().args.test_timeout)
-            self.killall_servers(server, ts, crash_occured)
+            save_join(lua, timeout=Options().args.test_timeout)
         except KeyboardInterrupt:
             # prevent tests greenlet from writing to the real stdout
             lua.kill()
-
-            ts.stop_nondefault()
             raise
         except TarantoolStartError as e:
             color_stdout('\n[Instance "{0}"] Failed to start tarantool '
                          'instance "{1}"\n'.format(server.name, e.name),
                          schema='error')
             server.kill_current_test()
+        finally:
+            ts.stop_nondefault()
 
 
 class PythonTest(Test):

Run app-tap/test-timeout.test.lua:

$ ./test/test-run.py --test-timeout 5 -j16 --force $(yes app-tap/test-timeout.test.lua | head -n 100)
<...>
[016] app-tap/test-timeout.test.lua                                   
[016] Test.run() received the following error:
[016] Traceback (most recent call last):
[016]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test.py", line 182, in run
[016]     self.execute(server)
[016]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 73, in execute
[016]     if retval['returncode'] != 0:
[016] KeyError: 'returncode'
[016] 
[016] [ fail ]
<...>

$ ps ax | grep [t]arantool
19320 pts/2    Sl     0:00 tarantool server.lua <running>
<..and 15 more like it..>
19543 pts/2    Rl     0:02 tarantool /home/alex/projects/tarantool-meta/r/t-3/test/app-tap/test-timeout.test.lua
<..and 49 more like it..>

The reason is the same 'The daemon is already running' error, which can be found in logs. The following patch fixes the problem:

diff --git a/lib/app_server.py b/lib/app_server.py
index a02200b..de485d2 100644
--- a/lib/app_server.py
+++ b/lib/app_server.py
@@ -70,7 +70,9 @@ class AppTest(Test):
         except TarantoolStartError:
             # A non-default server failed to start.
             raise TestExecutionError
-        if retval['returncode'] != 0:
+        finally:
+            ts.stop_nondefault()
+        if retval.get('returncode', None) != 0:
             raise TestExecutionError

Added 'test-timeout' option to be able to break the test process with kill signal if the test runs longer than this amount of seconds. By default it is equal to 110 seconds. This value should be bigger than 'replication-sync-timeout' (which is 100 seconds by default) and lower than 'no-output-timeout' (which is 120 seconds by default). This timeout helped to avoid of issues with hanging tests till reach of 'no-output-timeout' timeout, when overall testing exits. For now if the test hangs than 'test-timeout' timeout helps to exit the test processes. It gives the test-run worker chance to restart the failed test either continue tests in worker queue. Before this fix tests, hanged, like [1] and [2], for now the same issues resolved, like [3] and [4] appropriate. To reproduce the issues like [2], try to set 'test-timeout' not enough to complete the test on 'restart server ...' command, like: ./test-run.py replication/quorum.test.lua --test-timeout 5 \ --no-output-timeout 10 --conf memtx This commit implements terminating of stuck AppServer instances by SIGKILL. However there are still problems regarding stopping and waiting of non-default instances. They will be resolved in the following PRs / commits. See PR #244 for details. Part of #157 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968 [2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835 [3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993 [4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316

Totktonada

I'm okay with the patch.

As the tests above show, it does not resolve #157, so I changed the PR description and the commit message accordingly.

I'll send the corresponding fixes in a separate PR. I'll not update the submodule in tarantool after this patch. But I'll update it after the fixes.

Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157

If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157

Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157

If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157

Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157

If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157

Otherwise, if a next test will try to start a new instance with the same name, it may fail with the 'The daemon is already running' error, if a previous test fails or reaches --test-timeout. The commit fixes the problem for 'core = tarantool' test suites. The next commit will resolve it for 'core = app' test suites. See a test case in PR #244 (app/test-timeout.test.lua). A few words about details of the implementation. The killall_servers() method performs almost the same actions that stop_nondefault(), so I kept the latter and removed the former. However there are differences: - killall_servers() does not wait for termination of processes. This is why the change fixes given problem. - killall_servers() sends SIGKILL, while stop_nondefault() stop instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for termination. I'll return instant SIGKILL in one of the following commits: I splitted the changes for ease reading. Part of #157

If a test fails or reaches --test-timeout, some non-default tarantool instances may be left running and so if a next test starts an instance with the same name, it may fail with the 'The daemon is already running' error. This change follows the approach of the previous commit ('Wait until residual servers will be stopped'). See a test case in PR #244 (app-tap/test-timeout.test.lua). Part of #65 Part of #157

Totktonada · 2020-12-19T16:25:12Z

Updated the test-run submodule in tarantool in 2.7.0-116-g6e6e7b29b, 2.6.1-101-g13ff2ff2b, 2.5.2-65-gcdc10c888, 1.10.8-44-g0e95810dd.

Totktonada reviewed Nov 30, 2020

View reviewed changes

lib/tarantool_server.py Outdated Show resolved Hide resolved

Totktonada reviewed Nov 30, 2020

View reviewed changes

lib/app_server.py Outdated Show resolved Hide resolved

Totktonada reviewed Nov 30, 2020

View reviewed changes

lib/app_server.py Outdated Show resolved Hide resolved

avtikhon force-pushed the avtikhon/add-test-timeout branch from a45a988 to 16ff53d Compare November 30, 2020 04:57

avtikhon requested a review from Totktonada November 30, 2020 05:06

Totktonada reviewed Nov 30, 2020

View reviewed changes

lib/tarantool_server.py Show resolved Hide resolved

Totktonada requested a review from LeonidVas November 30, 2020 05:21

avtikhon self-assigned this Nov 30, 2020

avtikhon added [3sp] labels Nov 30, 2020

LeonidVas approved these changes Dec 1, 2020

View reviewed changes

avtikhon requested a review from Totktonada December 1, 2020 15:10

Totktonada reviewed Dec 6, 2020

View reviewed changes

lib/app_server.py Outdated Show resolved Hide resolved

Totktonada reviewed Dec 6, 2020

View reviewed changes

lib/tarantool_server.py Show resolved Hide resolved

Totktonada force-pushed the avtikhon/add-test-timeout branch from 16ff53d to 258d81c Compare December 6, 2020 10:23

Totktonada mentioned this pull request Dec 6, 2020

Give a warning when configured timeouts look strange #250

Closed

Totktonada force-pushed the avtikhon/add-test-timeout branch from 258d81c to 72ca19a Compare December 7, 2020 16:32

Totktonada approved these changes Dec 7, 2020

View reviewed changes

Totktonada merged commit 1f6d7ba into master Dec 7, 2020

Totktonada deleted the avtikhon/add-test-timeout branch December 7, 2020 16:38

This was referenced Dec 15, 2020

Crash detector looks broken #252

Open

test-run does not kill server that hangs in an instance script #256

Closed

Totktonada mentioned this pull request Dec 15, 2020

Stop instances properly #257

Merged

avtikhon mentioned this pull request Feb 1, 2021

test: flaky issue that replicas can't be terminated with SIGTERM on OSX tarantool/tarantool-qa#32

Open

Totktonada mentioned this pull request May 13, 2021

Convert reproducers from PR comments to tests #274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test_timeout to limit test run time #244

Add test_timeout to limit test run time #244

avtikhon commented Nov 26, 2020 •

edited by Totktonada

Loading

Totktonada commented Nov 30, 2020

avtikhon commented Nov 30, 2020

LeonidVas left a comment

Totktonada commented Dec 6, 2020

Totktonada commented Dec 7, 2020

Totktonada left a comment

Totktonada commented Dec 19, 2020

Add test_timeout to limit test run time #244

Add test_timeout to limit test run time #244

Conversation

avtikhon commented Nov 26, 2020 • edited by Totktonada Loading

Totktonada commented Nov 30, 2020

avtikhon commented Nov 30, 2020

LeonidVas left a comment

Choose a reason for hiding this comment

Totktonada commented Dec 6, 2020

Totktonada commented Dec 7, 2020

Totktonada left a comment

Choose a reason for hiding this comment

Totktonada commented Dec 19, 2020

avtikhon commented Nov 26, 2020 •

edited by Totktonada

Loading