Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-run left running subprocesses alive after exiting by timeout #382

Closed
ligurio opened this issue May 4, 2023 · 2 comments · Fixed by #384
Closed

test-run left running subprocesses alive after exiting by timeout #382

ligurio opened this issue May 4, 2023 · 2 comments · Fixed by #384

Comments

@ligurio
Copy link
Member

ligurio commented May 4, 2023

Steps to reproduce:

  • create a test in luatest format (I suspect the same will happen with other supported tests)
  • our new test requires 25 sec for running
  • run test with luatest, everything is ok
  • run test with test-run and set timeout "--no-output-timeout" to 15 sec

Expected result

test-run exits by timeout, because "--no-output-timeout" is less than test duration, and kills all subprocesses

Actual result

test-run exits by timeout and left luatest process alive

See log:

[1] ~/sources/MRG/tarantool$ cat test/box-luatest/sample_test.lua 
local t = require('luatest')
local g = t.group('test-run')

g.test_example_1 = function() os.execute("sleep 10") end


[1] ~/sources/MRG/tarantool$ luatest -c test/box-luatest/sample_test.lua 
Tarantool version is 2.11.0-rc2-0-g2ae0c94a2
Running with --shuffle group:6896
.
Ran 1 tests in 10.003 seconds, 1 succeeded, 0 failed

[0] ~/sources/MRG/tarantool$ ./test/test-run.py --debug --verbose --no-output-timeout=15 box-luatest/sample_test.lua                        

<snipped>

WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
No output during 10 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
No output during 20 seconds. Will abort after 15 seconds without output. List of workers not reporting the status:
- 001_box-luatest [box-luatest/sample_test.lua, None] at /tmp/t/001_box-luatest/sample.result:0
Test hung! Result content mismatch:
[File does not exist: box-luatest/sample.result]

[Main process] No output from workers. It seems that we hang. Send SIGKILL to workers; exiting...
---------------------------------------------------------------------------------
Top 10 tests by occupied memory (RSS, Mb):
*   19.8 box-luatest/sample_test.lua                                  

(Tests quicker than 0.1 seconds may be missed.)

---------------------------------------------------------------------------------
Top 10 longest tests (seconds):
---------------------------------------------------------------------------------
* undone: 1
[0] ~/sources/MRG/tarantool$ ps ax | grep tarantool
3238276 pts/6    Sl     0:00 tarantool /home/sergeyb/sources/MRG/tarantool/test-run/lib/luatest/bin/luatest -c -v test/box-luatest/sample_test.lua -o tap --shuffle none
3238301 pts/6    S+     0:00 grep tarantool

Bug was found on debugging etcd-client tests in https://github.com/tarantool/tarantool-ee/pull/444.

@Totktonada
Copy link
Member

Totktonada commented May 4, 2023

I looked over the luatest_server.py code and it seems the root of the problem is that it doesn't follow --test-timeout. We should add a timeout to the .communicate() call, when test-run is launched under Python 3 (compare: py2, py3).

(Another option is to use threading.Timer like in app_server.py.)

The same change has sense for unittest_server.py too.

How test-run handles different timeouts

test-run has a carcade of increasing timeouts.

  • Timeouts for activities within a test.
    • Server stop timeout (hardcoded to 5 seconds after SIGTERM, then SIGKILL and wait for termination). Applicable for tarantool and app tests1, which start utility servers.
    • --server-start-timeout (90 seconds). Applicable for tarantool and app tests.
  • --test-timeout (110 seconds). Applicable for tarantool, app tests.
  • --no-output-timeout (120 seconds). It 'stops the world', so we can say that it is applicable for all test types.

The idea here is to catch a problem as close as possible to its root.

--no-output-timeout is a kind of heuristical -- if nothing occurs during two minutes, something is likely going wrong.

This is the last resort timeout. It means that test-run is constructed with an idea that it should never occur. Hitting --no-output-timeout likely means a problem in test-run itself.

In the given case the problem is lack of --test-timeout handling in luatest_server.py.

Since the --no-output-timeout is the 'never occur' thing we just do what is possible -- kill the workers and exit with a report.

It seems move valuable to fix the cause of hitting --no-output-timeout rather than working on improving its actions.

However I have an idea here, which shouldn't be sophisticated: we can move each worker into its own process group/session and kill the entire group at hitting the timeout.

However, it has its own consequences: shell's job control will not see the workers, so, for example, Ctrl+Z will not pause them. (It can be handled by re-sending a signal from the main process to the workers, though.)

Footnotes

  1. There is a description of test-run's test types in https://github.com/tarantool/luatest/issues/191.

Totktonada added a commit that referenced this issue May 4, 2023
The implementation is borrowed from `lib/app_server.py`.

Closes #382
@Totktonada
Copy link
Member

I proposed a fix in PR #384. Sergey, please, take a look.

Remember to add something like --test-timeout=10 to your reproducer (and change sleep 10 to, say, sleep 25).

ylobankov pushed a commit that referenced this issue May 10, 2023
The implementation is borrowed from `lib/app_server.py`.

Closes #382
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants