test-run left running subprocesses alive after exiting by timeout #382

ligurio · 2023-05-04T20:31:33Z

Steps to reproduce:

create a test in luatest format (I suspect the same will happen with other supported tests)
our new test requires 25 sec for running
run test with luatest, everything is ok
run test with test-run and set timeout "--no-output-timeout" to 15 sec

Expected result

test-run exits by timeout, because "--no-output-timeout" is less than test duration, and kills all subprocesses

Actual result

test-run exits by timeout and left luatest process alive

See log:

[1] ~/sources/MRG/tarantool$ cat test/box-luatest/sample_test.lua 
local t = require('luatest')
local g = t.group('test-run')

g.test_example_1 = function() os.execute("sleep 10") end


[1] ~/sources/MRG/tarantool$ luatest -c test/box-luatest/sample_test.lua 
Tarantool version is 2.11.0-rc2-0-g2ae0c94a2
Running with --shuffle group:6896
.
Ran 1 tests in 10.003 seconds, 1 succeeded, 0 failed

[0] ~/sources/MRG/tarantool$ ./test/test-run.py --debug --verbose --no-output-timeout=15 box-luatest/sample_test.lua                        

<snipped>

WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
No output during 10 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
No output during 20 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
Test hung! Result content mismatch:
[File does not exist: box-luatest/sample.result]

[Main process] No output from workers. It seems that we hang. Send SIGKILL to workers; exiting...
---------------------------------------------------------------------------------
Top 10 tests by occupied memory (RSS, Mb):
*   19.8 box-luatest/sample_test.lua                                  

(Tests quicker than 0.1 seconds may be missed.)

---------------------------------------------------------------------------------
Top 10 longest tests (seconds):
---------------------------------------------------------------------------------
* undone: 1
[0] ~/sources/MRG/tarantool$ ps ax | grep tarantool
3238276 pts/6    Sl     0:00 tarantool /home/sergeyb/sources/MRG/tarantool/test-run/lib/luatest/bin/luatest -c -v test/box-luatest/sample_test.lua -o tap --shuffle none
3238301 pts/6    S+     0:00 grep tarantool

Bug was found on debugging etcd-client tests in https://github.com/tarantool/tarantool-ee/pull/444.

The text was updated successfully, but these errors were encountered:

Totktonada · 2023-05-04T21:53:28Z

I looked over the luatest_server.py code and it seems the root of the problem is that it doesn't follow --test-timeout. We should add a timeout to the .communicate() call, when test-run is launched under Python 3 (compare: py2, py3).

(Another option is to use threading.Timer like in app_server.py.)

The same change has sense for unittest_server.py too.

How test-run handles different timeouts

test-run has a carcade of increasing timeouts.

Timeouts for activities within a test.
- Server stop timeout (hardcoded to 5 seconds after SIGTERM, then SIGKILL and wait for termination). Applicable for tarantool and app tests¹, which start utility servers.
- --server-start-timeout (90 seconds). Applicable for tarantool and app tests.
--test-timeout (110 seconds). Applicable for tarantool, app tests.
--no-output-timeout (120 seconds). It 'stops the world', so we can say that it is applicable for all test types.

The idea here is to catch a problem as close as possible to its root.

--no-output-timeout is a kind of heuristical -- if nothing occurs during two minutes, something is likely going wrong.

This is the last resort timeout. It means that test-run is constructed with an idea that it should never occur. Hitting --no-output-timeout likely means a problem in test-run itself.

In the given case the problem is lack of --test-timeout handling in luatest_server.py.

Since the --no-output-timeout is the 'never occur' thing we just do what is possible -- kill the workers and exit with a report.

It seems move valuable to fix the cause of hitting --no-output-timeout rather than working on improving its actions.

However I have an idea here, which shouldn't be sophisticated: we can move each worker into its own process group/session and kill the entire group at hitting the timeout.

However, it has its own consequences: shell's job control will not see the workers, so, for example, Ctrl+Z will not pause them. (It can be handled by re-sending a signal from the main process to the workers, though.)

There is a description of test-run's test types in https://github.com/tarantool/luatest/issues/191. ↩

The implementation is borrowed from `lib/app_server.py`. Closes #382

Totktonada · 2023-05-04T22:16:28Z

I proposed a fix in PR #384. Sergey, please, take a look.

Remember to add something like --test-timeout=10 to your reproducer (and change sleep 10 to, say, sleep 25).

The implementation is borrowed from `lib/app_server.py`. Closes #382

Totktonada added a commit that referenced this issue May 4, 2023

Follow test timeout for luatest

99fec66

The implementation is borrowed from `lib/app_server.py`. Closes #382

Totktonada mentioned this issue May 4, 2023

Follow test timeout for luatest #384

Merged

This was referenced May 4, 2023

test-run expects test output on stdout and not on stderr #383

Open

Hanged out test is not reported as failed #388

Open

ylobankov closed this as completed in #384 May 10, 2023

ylobankov pushed a commit that referenced this issue May 10, 2023

Follow test timeout for luatest

f67d523

The implementation is borrowed from `lib/app_server.py`. Closes #382

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test-run left running subprocesses alive after exiting by timeout #382

test-run left running subprocesses alive after exiting by timeout #382

ligurio commented May 4, 2023

Totktonada commented May 4, 2023 •

edited

Loading

Totktonada commented May 4, 2023

test-run left running subprocesses alive after exiting by timeout #382

test-run left running subprocesses alive after exiting by timeout #382

Comments

ligurio commented May 4, 2023

Steps to reproduce:

Expected result

Actual result

Totktonada commented May 4, 2023 • edited Loading

Footnotes

Totktonada commented May 4, 2023

Totktonada commented May 4, 2023 •

edited

Loading