Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test_timeout to limit test run time #244

Merged
merged 1 commit into from
Dec 7, 2020

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Nov 26, 2020

Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

./test-run.py replication/quorum.test.lua --test-timeout 5
--no-output-timeout 10 --conf memtx

This commit implements terminating of stuck AppServer instances by
SIGKILL. However there are still problems regarding stopping and waiting
of non-default instances. They will be resolved in the following
commits. See PR #244 for details.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316

@Totktonada
Copy link
Member

Together with PR #186 is should resolve #157. Please, reflect this fact in the commit message.

@avtikhon avtikhon force-pushed the avtikhon/add-test-timeout branch from a45a988 to 16ff53d Compare November 30, 2020 04:57
@avtikhon
Copy link
Contributor Author

Made all changes as you suggested, please review.

@avtikhon avtikhon requested a review from Totktonada November 30, 2020 05:06
@Totktonada Totktonada requested a review from LeonidVas November 30, 2020 05:21
@avtikhon avtikhon self-assigned this Nov 30, 2020
Copy link

@LeonidVas LeonidVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea.

@avtikhon avtikhon requested a review from Totktonada December 1, 2020 15:10
@Totktonada Totktonada force-pushed the avtikhon/add-test-timeout branch from 16ff53d to 258d81c Compare December 6, 2020 10:23
@Totktonada
Copy link
Member

I rebased the PR on top of the current master and fixed the problem with mixed temporary result files. I still see problems around stopping non-default instances (both for core = 'tarantool' and core = 'app' tests). I'll looks around those problems and likely will fix them in the following PRs. However I'll investigate them first, only after this I'll able to say whether this PR is okay to push.

@Totktonada
Copy link
Member

TL;DR: I found several problems around killing and waiting of instances and I'll prepare fixes for them. I don't know whether it is better to let them be in a separate PR or add here: I'll think. However the current patch definitely doesn't close #157, since I'm able to get stuck instances.

Anyway, the PR still on my side.


Simple test cases:

$ cat test/app-tap/test-timeout.test.lua
#!/usr/bin/env tarantool

local tap = require('tap')
local test_run = require('test_run').new()

test_run:cmd("create server foo with script='app-tap/debug/server.lua'")
test_run:cmd('start server foo')

local test = tap.test('test-timeout')
test:plan(1)

while true do end

test:ok('passed')

test_run:cmd('stop server foo')
test_run:cmd('cleanup server foo')
test_run:cmd('delete server foo')

$ cat test/app/test-timeout.test.lua
test_run = require('test_run').new()

SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
test_run:wait_fullmesh(SERVERS)

while true do end

test_run:drop_cluster(SERVERS)

$ cat test/app/test-timeout.result
-- test-run result file version 2
test_run = require('test_run').new()
 | ---
 | ...

SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
 | ---
 | ...
test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
 | ---
 | ...
test_run:wait_fullmesh(SERVERS)
 | ---
 | ...

test_run:drop_cluster(SERVERS)
 | ---
 | ...

Run app/test-timeout.test.lua:

$ ./test/test-run.py --test-timeout 5 -j16 --force $(yes app/test-timeout.test.lua | head -n 100)
<...>
[002] [Instance "election_replica1" returns with non-zero exit code: 1]
<...>
[002] Starting instance election_replica1...
[002] The daemon is already running: PID 22915
[002] [ fail ]

$ ps ax | grep [t]arantool
<empty>

The reason is lack of waiting of a process to stop in killall_servers(), so the The daemon is already running error occurs. However I don't see any reason to duplicate the logic of TestState.stop_nondefault(), which is called at end of successful execution of a test. So, I would fix it this way:

diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index bdbc935..2d8ed34 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -335,37 +335,6 @@ class LuaTest(Test):
         # one.
         ts.stop_nondefault()
 
-    def killall_servers(self, server, ts, crash_occured):
-        """ kill all servers and crash detectors before stream swap """
-        color_log('Kill all servers ...\n', schema='info')
-        server_list = ts.servers.values() + [server, ]
-        check_list = [s for s in server_list if 'process' in s.__dict__]
-
-        # check that all servers stopped correctly
-        for server in check_list:
-            bad_returncode = server.process.returncode not in (None,
-                                                               0,
-                                                               -signal.SIGKILL,
-                                                               -signal.SIGTERM)
-            # if non-default server crashed but it was expected
-            # don't kill the default server and crash detectors
-            crash_occured = crash_occured or \
-                bad_returncode and not server.crash_expected
-
-        for server in check_list:
-            server.process.poll()
-
-            if crash_occured:
-                # kill all servers and crash detectors on crash
-                if server.process.returncode is None:
-                    server.process.kill()
-                if server.crash_detector is not None:
-                    gevent.kill(server.crash_detector)
-            elif server.process.returncode is not None:
-                # join crash detectors of stopped servers
-                if server.crash_detector is not None:
-                    save_join(server.crash_detector)
-
     def execute(self, server):
         super(LuaTest, self).execute(server)
         cls_name = server.__class__.__name__.lower()
@@ -386,21 +355,19 @@ class LuaTest(Test):
         lua = TestRunGreenlet(self.exec_loop, ts)
         self.current_test_greenlet = lua
         lua.start()
-        crash_occured = True
         try:
-            crash_occured = save_join(lua, timeout=Options().args.test_timeout)
-            self.killall_servers(server, ts, crash_occured)
+            save_join(lua, timeout=Options().args.test_timeout)
         except KeyboardInterrupt:
             # prevent tests greenlet from writing to the real stdout
             lua.kill()
-
-            ts.stop_nondefault()
             raise
         except TarantoolStartError as e:
             color_stdout('\n[Instance "{0}"] Failed to start tarantool '
                          'instance "{1}"\n'.format(server.name, e.name),
                          schema='error')
             server.kill_current_test()
+        finally:
+            ts.stop_nondefault()
 
 
 class PythonTest(Test):

Run app-tap/test-timeout.test.lua:

$ ./test/test-run.py --test-timeout 5 -j16 --force $(yes app-tap/test-timeout.test.lua | head -n 100)
<...>
[016] app-tap/test-timeout.test.lua                                   
[016] Test.run() received the following error:
[016] Traceback (most recent call last):
[016]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/test.py", line 182, in run
[016]     self.execute(server)
[016]   File "/home/alex/projects/tarantool-meta/r/t-3/test-run/lib/app_server.py", line 73, in execute
[016]     if retval['returncode'] != 0:
[016] KeyError: 'returncode'
[016] 
[016] [ fail ]
<...>

$ ps ax | grep [t]arantool
19320 pts/2    Sl     0:00 tarantool server.lua <running>
<..and 15 more like it..>
19543 pts/2    Rl     0:02 tarantool /home/alex/projects/tarantool-meta/r/t-3/test/app-tap/test-timeout.test.lua
<..and 49 more like it..>

The reason is the same 'The daemon is already running' error, which can be found in logs. The following patch fixes the problem:

diff --git a/lib/app_server.py b/lib/app_server.py
index a02200b..de485d2 100644
--- a/lib/app_server.py
+++ b/lib/app_server.py
@@ -70,7 +70,9 @@ class AppTest(Test):
         except TarantoolStartError:
             # A non-default server failed to start.
             raise TestExecutionError
-        if retval['returncode'] != 0:
+        finally:
+            ts.stop_nondefault()
+        if retval.get('returncode', None) != 0:
             raise TestExecutionError

Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

This commit implements terminating of stuck AppServer instances by
SIGKILL. However there are still problems regarding stopping and waiting
of non-default instances. They will be resolved in the following PRs /
commits. See PR #244 for details.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
@Totktonada Totktonada force-pushed the avtikhon/add-test-timeout branch from 258d81c to 72ca19a Compare December 7, 2020 16:32
Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the patch.

As the tests above show, it does not resolve #157, so I changed the PR description and the commit message accordingly.

I'll send the corresponding fixes in a separate PR. I'll not update the submodule in tarantool after this patch. But I'll update it after the fixes.

@Totktonada Totktonada merged commit 1f6d7ba into master Dec 7, 2020
@Totktonada Totktonada deleted the avtikhon/add-test-timeout branch December 7, 2020 16:38
Totktonada added a commit that referenced this pull request Dec 15, 2020
Otherwise, if a next test will try to start a new instance with the
same name, it may fail with the 'The daemon is already running' error,
if a previous test fails or reaches --test-timeout.

The commit fixes the problem for 'core = tarantool' test suites. The
next commit will resolve it for 'core = app' test suites.

See a test case in PR #244 (app/test-timeout.test.lua).

A few words about details of the implementation. The killall_servers()
method performs almost the same actions that stop_nondefault(), so I
kept the latter and removed the former. However there are differences:

- killall_servers() does not wait for termination of processes. This is
  why the change fixes given problem.
- killall_servers() sends SIGKILL, while stop_nondefault() stop
  instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for
  termination. I'll return instant SIGKILL in one of the following
  commits: I splitted the changes for ease reading.

Part of #157
Totktonada added a commit that referenced this pull request Dec 15, 2020
If a test fails or reaches --test-timeout, some non-default tarantool
instances may be left running and so if a next test starts an instance
with the same name, it may fail with the 'The daemon is already running'
error.

This change follows the approach of the previous commit ('Wait until
residual servers will be stopped').

See a test case in PR #244 (app-tap/test-timeout.test.lua).

Part of #65
Part of #157
avtikhon pushed a commit that referenced this pull request Dec 16, 2020
Otherwise, if a next test will try to start a new instance with the
same name, it may fail with the 'The daemon is already running' error,
if a previous test fails or reaches --test-timeout.

The commit fixes the problem for 'core = tarantool' test suites. The
next commit will resolve it for 'core = app' test suites.

See a test case in PR #244 (app/test-timeout.test.lua).

A few words about details of the implementation. The killall_servers()
method performs almost the same actions that stop_nondefault(), so I
kept the latter and removed the former. However there are differences:

- killall_servers() does not wait for termination of processes. This is
  why the change fixes given problem.
- killall_servers() sends SIGKILL, while stop_nondefault() stop
  instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for
  termination. I'll return instant SIGKILL in one of the following
  commits: I splitted the changes for ease reading.

Part of #157
avtikhon pushed a commit that referenced this pull request Dec 16, 2020
If a test fails or reaches --test-timeout, some non-default tarantool
instances may be left running and so if a next test starts an instance
with the same name, it may fail with the 'The daemon is already running'
error.

This change follows the approach of the previous commit ('Wait until
residual servers will be stopped').

See a test case in PR #244 (app-tap/test-timeout.test.lua).

Part of #65
Part of #157
avtikhon pushed a commit that referenced this pull request Dec 16, 2020
Otherwise, if a next test will try to start a new instance with the
same name, it may fail with the 'The daemon is already running' error,
if a previous test fails or reaches --test-timeout.

The commit fixes the problem for 'core = tarantool' test suites. The
next commit will resolve it for 'core = app' test suites.

See a test case in PR #244 (app/test-timeout.test.lua).

A few words about details of the implementation. The killall_servers()
method performs almost the same actions that stop_nondefault(), so I
kept the latter and removed the former. However there are differences:

- killall_servers() does not wait for termination of processes. This is
  why the change fixes given problem.
- killall_servers() sends SIGKILL, while stop_nondefault() stop
  instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for
  termination. I'll return instant SIGKILL in one of the following
  commits: I splitted the changes for ease reading.

Part of #157
avtikhon pushed a commit that referenced this pull request Dec 16, 2020
If a test fails or reaches --test-timeout, some non-default tarantool
instances may be left running and so if a next test starts an instance
with the same name, it may fail with the 'The daemon is already running'
error.

This change follows the approach of the previous commit ('Wait until
residual servers will be stopped').

See a test case in PR #244 (app-tap/test-timeout.test.lua).

Part of #65
Part of #157
Totktonada added a commit that referenced this pull request Dec 18, 2020
Otherwise, if a next test will try to start a new instance with the
same name, it may fail with the 'The daemon is already running' error,
if a previous test fails or reaches --test-timeout.

The commit fixes the problem for 'core = tarantool' test suites. The
next commit will resolve it for 'core = app' test suites.

See a test case in PR #244 (app/test-timeout.test.lua).

A few words about details of the implementation. The killall_servers()
method performs almost the same actions that stop_nondefault(), so I
kept the latter and removed the former. However there are differences:

- killall_servers() does not wait for termination of processes. This is
  why the change fixes given problem.
- killall_servers() sends SIGKILL, while stop_nondefault() stop
  instances as usual: SIGTERM, wait 5 seconds at max, SIGKILL, wait for
  termination. I'll return instant SIGKILL in one of the following
  commits: I splitted the changes for ease reading.

Part of #157
Totktonada added a commit that referenced this pull request Dec 18, 2020
If a test fails or reaches --test-timeout, some non-default tarantool
instances may be left running and so if a next test starts an instance
with the same name, it may fail with the 'The daemon is already running'
error.

This change follows the approach of the previous commit ('Wait until
residual servers will be stopped').

See a test case in PR #244 (app-tap/test-timeout.test.lua).

Part of #65
Part of #157
@Totktonada
Copy link
Member

Updated the test-run submodule in tarantool in 2.7.0-116-g6e6e7b29b, 2.6.1-101-g13ff2ff2b, 2.5.2-65-gcdc10c888, 1.10.8-44-g0e95810dd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants