Set instances restart to SIGKILL by default #235

avtikhon · 2020-11-10T13:28:45Z

Found that some of the instances could not be killed with SIGTERM signal
and it really needed SIGKILL signal to be sure that the instance would
be terminated. Otherwise some tests may fail, like:

  No output during 120 seconds. Will abort after 120 seconds without
output. List of workers not reporting the status:
  - 001_engine [engine/json.test.lua, memtx] at
var/001_engine/json.result:598
  Test hung! Result content mismatch:
  --- engine/json.result	Fri Nov  6 08:35:45 2020
  +++ var/001_engine/json.result	Mon Nov  9 03:48:40 2020
  @@ -596,248 +596,3 @@
   - ok
   ...
   test_run:cmd("restart server default")
  -engine = test_run:get_cfg('engine')
  ----
  -…

Test run uses 'tarantoolctl' tool for commands in format "restart server
default", and 'tarantoolctl' had only single way to kill it using
SIGTERM signal. To implement ability to use it with SIGKILL was added
additional non-mandatory option to stop and restart 'tarantoolctl'
commands, like:

tarantoolctl stop [--signal=SIGKILL]
tarantoolctl restart [--signal=SIGKILL]

This options didn't change the backward compatibility of 'tarantoolctl'
tool, but gave test-run ability to stop and restart instances with
SIGKILL signal:

tarantoolctl restart --signal=9

It helped to fix the issue with hanging instances in the needed tests.

Found that some of the instances could not be killed with SIGTERM signal and it really needed SIGKILL signal to be sure that the instance would be terminated. Otherwise some tests may fail, like: No output during 120 seconds. Will abort after 120 seconds without output. List of workers not reporting the status: - 001_engine [engine/json.test.lua, memtx] at var/001_engine/json.result:598 Test hung! Result content mismatch: --- engine/json.result Fri Nov 6 08:35:45 2020 +++ var/001_engine/json.result Mon Nov 9 03:48:40 2020 @@ -596,248 +596,3 @@ - ok ... test_run:cmd("restart server default") -engine = test_run:get_cfg('engine') ---- -… Test run uses 'tarantoolctl' tool for commands in format "restart server default", and 'tarantoolctl' had only single way to kill it using SIGTERM signal. To implement ability to use it with SIGKILL was added additional non-mandatory option to stop and restart 'tarantoolctl' commands, like: tarantoolctl stop <instance> [--signal=SIGKILL] tarantoolctl restart <instance> [--signal=SIGKILL] This options didn't change the backward compatibility of 'tarantoolctl' tool, but gave test-run ability to stop and restart instances with SIGKILL signal: tarantoolctl restart <instance> --signal=9 It helped to fix the issue with hanging instances in the needed tests.

Totktonada · 2020-12-06T03:04:01Z

If something is wrong with tarantoolctl start, please, provide information, what exactly. We should not blindly change it to tarantoolctl restart without any understanding of the underlying problem. It just counter-intuitive: doing a restart in a method that is named as 'start'. You don't ever leave a comment here: just like it is normal and obvious to do 'restart --signal=9' in a start() method. Argh.

Regarding SIGKILL, the appropriate fix was implemented in PR #186: the commit links exact problems in tarantool and it is obvious how the commit changes the behaviour of the testing system. The testing system issues a warning, when meet the bad behaviour of tarantool, not just 'fix' tarantool's behaviour. Here you changed the place, where old server was not killed for some reason, but the new server with the same script wants to be started in the same directory (on the same worker). Since PR #186 resolves the main problem with stopping stuck instances, the only way to step into the problem is to miss a stop command and try to start a new server. Are you have a case of this kind?

Anyway, it should be implemented just like PR #186: try SIGTERM, wait, issue a warning and kill by SIGKILL. Maybe even it worth to give a warning if we found any old server: it should not occur if a test is written in a right way. Or, maybe, even fail the test that attempts to do this strange sequence of steps. We need a real case of this behaviour to decide.

Past days the kill_old_server() method was more actual: test-run often miss tarantool's and don't stop them; test/var directory was not removed before each testing.

I'll close the PR, because I don't see any prove that a problem exists. Feel free to reopen, when you'll have a reproducer. Don't miss the suggestions above.

avtikhon requested a review from Totktonada November 10, 2020 13:28

avtikhon self-assigned this Nov 10, 2020

avtikhon force-pushed the avtikhon/gh-5510-restart-sigkill branch 2 times, most recently from f42fa88 to d344a7b Compare November 11, 2020 05:05

avtikhon force-pushed the avtikhon/gh-5510-restart-sigkill branch from d344a7b to 591b576 Compare November 21, 2020 05:10

avtikhon force-pushed the avtikhon/use_kill_after_term branch from d3bfc60 to daed116 Compare November 21, 2020 05:34

Totktonada closed this Dec 6, 2020

avtikhon deleted the avtikhon/gh-5510-restart-sigkill branch December 6, 2020 11:12

Totktonada mentioned this pull request May 13, 2021

tarantoolctl: need to implement ability to stop instances with signal from options tarantool/tarantool#5510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set instances restart to SIGKILL by default #235

Set instances restart to SIGKILL by default #235

avtikhon commented Nov 10, 2020 •

edited

Loading

Totktonada commented Dec 6, 2020

Set instances restart to SIGKILL by default #235

Set instances restart to SIGKILL by default #235

Conversation

avtikhon commented Nov 10, 2020 • edited Loading

Totktonada commented Dec 6, 2020

avtikhon commented Nov 10, 2020 •

edited

Loading