Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the 'timeout' option into 'stop server' command and kill a server… #186

Merged
merged 2 commits into from
Dec 4, 2020

Conversation

rtokarev
Copy link
Contributor

@rtokarev rtokarev commented Aug 19, 2019

… with

'SIGKILL' if it doesn't finished before timeout is expired..

By default drop_cluster() routine uses SIGTERM signal to stop the
replications. Found that in some situations SIGTERM couldn't kill
all instances on OSX and some processes left. To avoid of such
situations need additionally to send SIGKILL signal to all instances
that were not finished before timeout was expired to be able to stop
them all.

avtikhon added a commit that referenced this pull request Nov 25, 2020
avtikhon added a commit that referenced this pull request Nov 27, 2020
@avtikhon
Copy link
Contributor

avtikhon commented Nov 27, 2020

PR successfully checked on multi runs in CI:
https://gitlab.com/tarantool/tarantool/-/pipelines?page=1&scope=all&ref=avtikhon%2Fflaky_stable_min_prs

Also it better resolves issues:

  1. with dropping cluster than PR Use SIGKILL in drop_cluster() routine #232, because it fixes the issue in core if of the function, and it will be used also in python depending calls, while PR Use SIGKILL in drop_cluster() routine #232 only fixes it for single call to stop cluster in drop_cluster() routine.
  2. it fixes for OSX the same issue that PR Move tarantoolctl to test-run tool submodule and add replication_sync_timeout to it #242 fixes for Linux only, because it stops the instances hanging on OSX while tarantoolctl could not do it in PR Move tarantoolctl to test-run tool submodule and add replication_sync_timeout to it #242.

Patch LGTM.

@avtikhon avtikhon requested a review from Totktonada November 27, 2020 05:58
@avtikhon avtikhon self-assigned this Nov 27, 2020
@Totktonada
Copy link
Member

Part of #157 (solves the problem for 'core = tarantool' test suites).

Totktonada added a commit that referenced this pull request Nov 29, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
@Totktonada
Copy link
Member

This PR would be helpful for us to overcome https://github.com/tarantool/tarantool/issues/5573, that hit us in testing from time to time.

Now we have only one place, where a [QA Notice] message is printed. But
we'll have more and it is convenient to have one function for this
purpose.

The function should be used in places, where we observe an incorrect
behaviour of tarantool or a test, but want to lower it to a warning and
proceed further. Of course, there should be a solid reason to do so.

Part of tarantool#157
@Totktonada Totktonada force-pushed the kill_server_after_timeout branch from 0aa92c7 to 5d69357 Compare November 30, 2020 01:36
If a tarantool instance does not handle SIGTERM correctly, it is the bug
in tarantool. A testing system should reveal problems rather than hide
them. However this patch doing that and I'll explain why.

Many tests we have perform instance starting / stopping: they are
written to spot different problems, but not this one. If everything else
except hanging at stop is okay, there is no reason to fail the test. We
print a notice on the terminal in the case and proceed further.

In future, when all problems of this kind will be resolved, we can
interpret presence of such notices as error. However now it'll not add
anything for quality of our testing.

We had a problem of this kind ([1]) in the past and now have another one
([2]). Until it will be resolved, it worth to handle the situation on
the testing system side.

[1]: tarantool/tarantool#4127
[2]: https://github.com/tarantool/tarantool/issues/5573

Part of tarantool#157

Co-authored-by: Alexander Turenko <[email protected]>
@Totktonada Totktonada force-pushed the kill_server_after_timeout branch from 5d69357 to cf397eb Compare November 30, 2020 01:37
@Totktonada
Copy link
Member

Updated the patch:

  • Rebased on the latest master.
  • Added a patch with the qa_notice() function that I want to use here.
  • Issue a QA notice in case of a stuck tarantool.
  • Define the kill() function within the stop() one, because the warning becomes specific for the situation in stop().
  • Removed timeout parameter from stop(), because it is not used anywhere for now.
  • Written more about motivation in the commit message and added myself to co-authors.

@Totktonada Totktonada requested a review from LeonidVas November 30, 2020 01:43
avtikhon pushed a commit that referenced this pull request Nov 30, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
avtikhon added a commit that referenced this pull request Nov 30, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon pushed a commit that referenced this pull request Nov 30, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
avtikhon added a commit that referenced this pull request Nov 30, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
@avtikhon
Copy link
Contributor

The patch successfully checked at [1], LGTM:

[1] - https://gitlab.com/tarantool/tarantool/-/pipelines/222917874

@avtikhon avtikhon self-requested a review November 30, 2020 05:04
avtikhon added a commit that referenced this pull request Nov 30, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon pushed a commit that referenced this pull request Nov 30, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
avtikhon added a commit that referenced this pull request Nov 30, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon pushed a commit that referenced this pull request Nov 30, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
avtikhon added a commit that referenced this pull request Dec 1, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon pushed a commit that referenced this pull request Dec 1, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
avtikhon added a commit that referenced this pull request Dec 1, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon pushed a commit that referenced this pull request Dec 1, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
avtikhon added a commit that referenced this pull request Dec 1, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

The fix resolves the issue #157 together with PR #186, which helps
to kill the instances when SIGTERM couldn't do it.

Part of #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon pushed a commit that referenced this pull request Dec 1, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
Copy link

@LeonidVas LeonidVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

avtikhon pushed a commit that referenced this pull request Dec 3, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
Totktonada added a commit that referenced this pull request Dec 4, 2020
When there is no output from workers during a long time (10 seconds by
default or 60 seconds when --long argument is passed), test-run prints a
warning and shows amount of lines in the temporary result file. It is
useful to understand on which statement a test hungs.

I reproduced the problem, when mangled tarantool to ignore SIGTERM and
SIGINT signals and run a simple 'tarantool = core' test. The test
successfully passes, but the worker stucks in waiting for stopping the
tarantool server.

This particular case should be resolved in PR #186, but just because the
timeout for stopping the server is less than the warning delay. This
assumption looks fragile, especially if we'll want to make some of those
timeouts / delays configurable. Let's handle the situation when the file
does not exist.

Found while looking into
https://github.com/tarantool/tarantool/issues/5573

Fixes #245
@Totktonada Totktonada merged commit 67bd47d into tarantool:master Dec 4, 2020
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 4, 2020
Limit waiting for a tarantool process termination by 5 seconds. When
this timeout exceeded, print a warning to the terminal and send SIGKILL
to the process.

We need to handle the situation with a stuck tarantool process on the
testing system side to overcome a problem of this kind that appears on
Mac OS (see #5573).

This changeset handles one particular case: stopping of a tarantool
instance that either started for execution of a 'core = tarantool' test
suite or started from a test using the `test_run:cmd('start server
foo')` command. It does not handle stopping of tarantool that is started
for execution of a 'core = app' test or started from a test directly
using io.popen() or built-in 'popen' module.

Related to #5573
Part of tarantool/test-run#157
The changeset: tarantool/test-run#186
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 4, 2020
Limit waiting for a tarantool process termination by 5 seconds. When
this timeout exceeded, print a warning to the terminal and send SIGKILL
to the process.

We need to handle the situation with a stuck tarantool process on the
testing system side to overcome a problem of this kind that appears on
Mac OS (see #5573).

This changeset handles one particular case: stopping of a tarantool
instance that either started for execution of a 'core = tarantool' test
suite or started from a test using the `test_run:cmd('start server
foo')` command. It does not handle stopping of tarantool that is started
for execution of a 'core = app' test or started from a test directly
using io.popen() or built-in 'popen' module.

Related to #5573
Part of tarantool/test-run#157
The changeset: tarantool/test-run#186

(cherry picked from commit 24f57a3)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 4, 2020
Limit waiting for a tarantool process termination by 5 seconds. When
this timeout exceeded, print a warning to the terminal and send SIGKILL
to the process.

We need to handle the situation with a stuck tarantool process on the
testing system side to overcome a problem of this kind that appears on
Mac OS (see #5573).

This changeset handles one particular case: stopping of a tarantool
instance that either started for execution of a 'core = tarantool' test
suite or started from a test using the `test_run:cmd('start server
foo')` command. It does not handle stopping of tarantool that is started
for execution of a 'core = app' test or started from a test directly
using io.popen() or built-in 'popen' module.

Related to #5573
Part of tarantool/test-run#157
The changeset: tarantool/test-run#186

(cherry picked from commit 24f57a3)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Dec 4, 2020
Limit waiting for a tarantool process termination by 5 seconds. When
this timeout exceeded, print a warning to the terminal and send SIGKILL
to the process.

We need to handle the situation with a stuck tarantool process on the
testing system side to overcome a problem of this kind that appears on
Mac OS (see #5573).

This changeset handles one particular case: stopping of a tarantool
instance that either started for execution of a 'core = tarantool' test
suite or started from a test using the `test_run:cmd('start server
foo')` command. It does not handle stopping of tarantool that is started
for execution of a 'core = app' test or started from a test directly
using io.popen() or built-in 'popen' module.

Related to #5573
Part of tarantool/test-run#157
The changeset: tarantool/test-run#186

(cherry picked from commit 24f57a3)
@Totktonada
Copy link
Member

Updated the test-run submodule in tarantool in the following commits: 2.7.0-83-g24f57a353, 2.6.1-70-g5e660f7cd, 2.5.2-44-g85eb5eea6, 1.10.8-31-g1e01031e9.

Totktonada pushed a commit that referenced this pull request Dec 6, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

This commit finally resolves the problem, when a stuck tarantool is not
terminated after SIGTERM. The PR #186 fixes the first part:
TarantoolServer. Now AppServer case is handled too.

Fixes #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
avtikhon added a commit that referenced this pull request Dec 6, 2020
Added 'test-timeout' option to be able to break the test process with
kill signal if the test runs longer than this amount of seconds. By
default it is equal to 110 seconds. This value should be bigger than
'replication-sync-timeout' (which is 100 seconds by default) and
lower than 'no-output-timeout' (which is 120 seconds by default).

This timeout helped to avoid of issues with hanging tests till reach
of 'no-output-timeout' timeout, when overall testing exits. For now
if the test hangs than 'test-timeout' timeout helps to exit the test
processes. It gives the test-run worker chance to restart the failed
test either continue tests in worker queue. Before this fix tests,
hanged, like [1] and [2], for now the same issues resolved, like [3]
and [4] appropriate.

To reproduce the issues like [2], try to set 'test-timeout' not enough
to complete the test on 'restart server ...' command, like:

  ./test-run.py replication/quorum.test.lua --test-timeout 5 \
    --no-output-timeout 10 --conf memtx

This commit finally resolves the problem, when a stuck tarantool is not
terminated after SIGTERM. The PR #186 fixes the first part:
TarantoolServer. Now AppServer case is handled too.

Fixes #157

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/835734706#L4968
[2] - https://gitlab.com/tarantool/tarantool/-/jobs/822649038#L4835
[3] - https://gitlab.com/tarantool/tarantool/-/jobs/874058059#L4993
[4] - https://gitlab.com/tarantool/tarantool/-/jobs/874058745#L5316
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants