Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set instances restart to SIGKILL by default #235

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Nov 10, 2020

Found that some of the instances could not be killed with SIGTERM signal
and it really needed SIGKILL signal to be sure that the instance would
be terminated. Otherwise some tests may fail, like:

  No output during 120 seconds. Will abort after 120 seconds without
output. List of workers not reporting the status:
  - 001_engine [engine/json.test.lua, memtx] at
var/001_engine/json.result:598
  Test hung! Result content mismatch:
  --- engine/json.result	Fri Nov  6 08:35:45 2020
  +++ var/001_engine/json.result	Mon Nov  9 03:48:40 2020
  @@ -596,248 +596,3 @@
   - ok
   ...
   test_run:cmd("restart server default")
  -engine = test_run:get_cfg('engine')
  ----
  -…

Test run uses 'tarantoolctl' tool for commands in format "restart server
default", and 'tarantoolctl' had only single way to kill it using
SIGTERM signal. To implement ability to use it with SIGKILL was added
additional non-mandatory option to stop and restart 'tarantoolctl'
commands, like:

tarantoolctl stop [--signal=SIGKILL]
tarantoolctl restart [--signal=SIGKILL]

This options didn't change the backward compatibility of 'tarantoolctl'
tool, but gave test-run ability to stop and restart instances with
SIGKILL signal:

tarantoolctl restart --signal=9

It helped to fix the issue with hanging instances in the needed tests.

@avtikhon avtikhon requested a review from Totktonada November 10, 2020 13:28
@avtikhon avtikhon self-assigned this Nov 10, 2020
@avtikhon avtikhon force-pushed the avtikhon/gh-5510-restart-sigkill branch 2 times, most recently from f42fa88 to d344a7b Compare November 11, 2020 05:05
Found that some of the instances could not be killed with SIGTERM signal
and it really needed SIGKILL signal to be sure that the instance would
be terminated. Otherwise some tests may fail, like:

  No output during 120 seconds. Will abort after 120 seconds without
output. List of workers not reporting the status:
  - 001_engine [engine/json.test.lua, memtx] at
var/001_engine/json.result:598
  Test hung! Result content mismatch:
  --- engine/json.result	Fri Nov  6 08:35:45 2020
  +++ var/001_engine/json.result	Mon Nov  9 03:48:40 2020
  @@ -596,248 +596,3 @@
   - ok
   ...
   test_run:cmd("restart server default")
  -engine = test_run:get_cfg('engine')
  ----
  -…

Test run uses 'tarantoolctl' tool for commands in format "restart server
default", and 'tarantoolctl' had only single way to kill it using
SIGTERM signal. To implement ability to use it with SIGKILL was added
additional non-mandatory option to stop and restart 'tarantoolctl'
commands, like:

  tarantoolctl stop <instance> [--signal=SIGKILL]
  tarantoolctl restart <instance> [--signal=SIGKILL]

This options didn't change the backward compatibility of 'tarantoolctl'
tool, but gave test-run ability to stop and restart instances with
SIGKILL signal:

  tarantoolctl restart <instance> --signal=9

It helped to fix the issue with hanging instances in the needed tests.
@avtikhon avtikhon force-pushed the avtikhon/gh-5510-restart-sigkill branch from d344a7b to 591b576 Compare November 21, 2020 05:10
@avtikhon avtikhon force-pushed the avtikhon/use_kill_after_term branch from d3bfc60 to daed116 Compare November 21, 2020 05:34
@Totktonada
Copy link
Member

If something is wrong with tarantoolctl start, please, provide information, what exactly. We should not blindly change it to tarantoolctl restart without any understanding of the underlying problem. It just counter-intuitive: doing a restart in a method that is named as 'start'. You don't ever leave a comment here: just like it is normal and obvious to do 'restart --signal=9' in a start() method. Argh.

Regarding SIGKILL, the appropriate fix was implemented in PR #186: the commit links exact problems in tarantool and it is obvious how the commit changes the behaviour of the testing system. The testing system issues a warning, when meet the bad behaviour of tarantool, not just 'fix' tarantool's behaviour. Here you changed the place, where old server was not killed for some reason, but the new server with the same script wants to be started in the same directory (on the same worker). Since PR #186 resolves the main problem with stopping stuck instances, the only way to step into the problem is to miss a stop command and try to start a new server. Are you have a case of this kind?

Anyway, it should be implemented just like PR #186: try SIGTERM, wait, issue a warning and kill by SIGKILL. Maybe even it worth to give a warning if we found any old server: it should not occur if a test is written in a right way. Or, maybe, even fail the test that attempts to do this strange sequence of steps. We need a real case of this behaviour to decide.

Past days the kill_old_server() method was more actual: test-run often miss tarantool's and don't stop them; test/var directory was not removed before each testing.

I'll close the PR, because I don't see any prove that a problem exists. Feel free to reopen, when you'll have a reproducer. Don't miss the suggestions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants