Stop replica with SIGKILL if SIGTERM failed #231

avtikhon · 2020-11-04T12:39:25Z

Fixed 2 issues:

Stop replica with SIGKILL if SIGTERM failed

Found that if the previous test leaves process of the created cluster
replica then the next same test fails on its recreation [1]:

  [047] replication/election_qsync_stress.test.lua      vinyl
  [047]
  [047] [Instance "election_replica2" returns with non-zero exit code: 1]
  [047]
  [047] Last 15 lines of Tarantool Log file [Instance
  ...
  [047] 2020-11-05 13:19:25.941 [29831] main/114/applier/unix/:/private/tmp/tnt/047_replication/election_replica3.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [047] [ fail ]
  [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
  [047]         from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ...
  [047] replication/election_qsync_stress.test.lua      vinyl           [ fail ]
  [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
  [047]         from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ...
  [047] replication/election_qsync_stress.test.lua      vinyl           [ fail ]
  ...

To fix the issue replica should be killed with signal SIGKILL if
SIGTERM signal didn't kill it's process.

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/831786472#L5060

Use restart instead of start on instance creation

Found that if the previous test leaves process of the created cluster
replica then the next same test fails on its recreation:

  [101] replication/ddl.test.lua                        memtx
  [101]
  [101] [Instance "ddl3" returns with non-zero exit code: 1]
  [101]
  [101] Last 15 lines of Tarantool Log file [Instance "ddl3"][/tmp/tnt/101_replication/ddl3.log]:
  ...
  [101] 2020-11-03 10:31:10.589 [6557] main/115/applier/cluster@unix/:/private/tmp/tnt/101_replication/autobootstrap4.sock
       box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-onlymode.
  [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [101] [ fail ]
  [101] Test "replication/ddl.test.lua", conf: "memtx"
  [101]         from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
  [101] replication/ddl.test.lua                        memtx           [ fail ]
  [101] Test "replication/ddl.test.lua", conf: "memtx"
  [101]         from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
  [101] replication/ddl.test.lua                        memtx           [ fail ]
  [101] Test "replication/ddl.test.lua", conf: "memtx"
  [101]         from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
  ...
  [101] [Instance "ddl1" returns with non-zero exit code: 1]
  [101]
  [101] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/101_replication/ddl1.log]:
  [101] The daemon is already running: PID 6553
  [101] Starting instance ddl1...
  [101] The daemon is already running: PID 6553
  [101] Starting instance ddl1...
  ...

Found that test-run uses 'tarantoolctl start' call to start the
instance, but if the previous test failed then on rerun it should
be stopped before it. To fix it the call changed to 'tarantoolctl
restart'.

Totktonada · 2020-11-21T00:45:28Z

lib/tarantool_server.py

+        self.signal_default = signal.SIGTERM
+        self.signal_kill = signal.SIGKILL


I don't see a reason to omitting direct usage of signal.SIGTERM and signal.SIGKILL.

Totktonada · 2020-11-21T02:36:56Z

lib/tarantool_server.py

+            # check if the process died, otherwise if SIGTERM failed use SIGKILL
+            try:
+                self.process.send_signal(0)
+            except OSError:
+                pass
+            else:
+                if signal == self.signal_default:
+                    signal = self.signal_kill
+                    color_log('Sending signal {0} ({1}) to process {2}\n'.format(
+                              signal, signame(signal), self.process.pid))
+                    self.process.send_signal(signal)


I don't understand, how you suppose I should review the patch. There are patches, which are in the other PRs (and I don't see changes of the same place in the code, at least not here). Should I look only at the commit, whose title is equal to the PR title? I guess so. But I don't understand, why other five are here.

First, the same as for PR #228, PR #229 and PR #230:

A normal termination fails. It unlikely is the expected situation. So a test, which asks for stopping the instance, should receive an error in the case. At least a warning should be printed at the terminal. It should not be a silent attempt to 'fix' an incorrect behaviour of tarantool or a test: it is not what a testing system should do.

Second, I'm not sure that the provided analysis is correct:

I run the same test many times and I didn't see any instances that would not be successfully killed by killall tarantool afterwards (which sends SIGTERM).

When the test hangs, I see in ps ax one or two replicasets, where one of instances in is orphan mode. Nothing like 'stuck instance from an old replicaset + 3 from the new'. Or, maybe, it is '1 stuck from old + 2 from the new'? Anyway, I need some prove of your theory. E.g. more logging in test-run, that would reveals a problem.

Third, I don't understand how it supposed to work.

self.wait_stop() waits for the process using self.process.wait(), which should not return until the process will be terminated (according to the docs). How it is possible that it returned, but the process is alive? If you found a prove that it really so, let's share it.

send_signal(0) will or will not return OSError depending of whether it was pooled by pool(), wait(), communicate() or so (I verified it on Linux). And this behaviour is not specified at all in the docs for Python 2. Moreover, Python 3 fixes this inconsistency and its docs says Do nothing if the process completed (and I verified it too, it is so). If even we'll rely on the unspecified and possibly platform-specific behaviour of Python 2, the code will become silently break after the upcoming Python 3 enabling (it is in work again, AFAIK). I guess you confuses <Popen_object>.send_signal() with os.kill().

There is @rtokarev's patch (PR Add the 'timeout' option into 'stop server' command and kill a server… #186) with the same idea in the mind, but it at least looks workable in theory.

Okay, maybe I miss something obvious? I applied the patch (only this one, without others in the PR) and the testing hangs just like before. See below, I need a method to verify the changes.

And the last:

It seems, your way to reveal problem roots and verify changes is not sufficient. Let's define a process of local testing of test-run changes of this kind (regarding flaky test fails), which will prove that:

the problem exists;

the exact problem is shown by some logs;

the problem is resolved after the patch (and the logs prove this too).

Sasha, please, take the 8th point very seriously. If you'll not give proofs that you correctly spotted a problem and actually fix it, I'll be obligated to do it as part of a review. This way any available time for reviewing your patches will reach its end soon (because I have other tasks with its own deadlines).

Of course, in theory I could just verify the code correctness without attempting to understand, whether it'll solve the problem that is given as the reason of the change. However if everything that I written above is more or less truth (and I don't missed something very obvious), I should ask for proofs or find them myself. Sorry for such tight criterias, but I don't see any other way to proceed with patches of this kind. It is not something, where there is obvious way to verify the changes, so development of a verification method is the integral part of the task.

Found that if the previous test leaves process of the created cluster replica then the next same test fails on its recreation: [101] replication/ddl.test.lua memtx [101] [101] [Instance "ddl3" returns with non-zero exit code: 1] [101] [101] Last 15 lines of Tarantool Log file [Instance "ddl3"][/tmp/tnt/101_replication/ddl3.log]: ... [101] 2020-11-03 10:31:10.589 [6557] main/115/applier/cluster@unix/:/private/tmp/tnt/101_replication/autobootstrap4.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode. [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-onlymode. [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [101] [ fail ] [101] Test "replication/ddl.test.lua", conf: "memtx" [101] from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ... [101] replication/ddl.test.lua memtx [ fail ] [101] Test "replication/ddl.test.lua", conf: "memtx" [101] from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ... [101] replication/ddl.test.lua memtx [ fail ] [101] Test "replication/ddl.test.lua", conf: "memtx" [101] from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ... ... [101] [Instance "ddl1" returns with non-zero exit code: 1] [101] [101] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/101_replication/ddl1.log]: [101] The daemon is already running: PID 6553 [101] Starting instance ddl1... [101] The daemon is already running: PID 6553 [101] Starting instance ddl1... ... Found that test-run uses 'tarantoolctl start' call to start the instance, but if the previous test failed then on rerun it should be stopped before it. To fix it the call changed to 'tarantoolctl restart'.

Found that if the previous test leaves process of the created cluster replica then the next same test fails on its recreation [1]: [047] replication/election_qsync_stress.test.lua vinyl [047] [047] [Instance "election_replica2" returns with non-zero exit code: 1] [047] [047] Last 15 lines of Tarantool Log file [Instance ... [047] 2020-11-05 13:19:25.941 [29831] main/114/applier/unix/:/private/tmp/tnt/047_replication/election_replica3.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode. [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode. [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [047] [ fail ] [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl" [047] from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ... [047] replication/election_qsync_stress.test.lua vinyl [ fail ] [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl" [047] from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ... [047] replication/election_qsync_stress.test.lua vinyl [ fail ] ... To fix the issue replica should be killed with signal SIGKILL if SIGTERM signal didn't kill it's process. [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831786472#L5060

Totktonada · 2020-12-06T02:32:17Z

Superseded by PR #186.

avtikhon requested a review from Totktonada November 4, 2020 12:39

avtikhon force-pushed the avtikhon/use_kill_after_term branch 4 times, most recently from 1ae048b to d3bfc60 Compare November 5, 2020 11:19

Totktonada reviewed Nov 21, 2020

View reviewed changes

avtikhon added 2 commits November 21, 2020 08:34

avtikhon force-pushed the avtikhon/use_kill_after_term branch from d3bfc60 to daed116 Compare November 21, 2020 05:34

Totktonada closed this Dec 6, 2020

avtikhon deleted the avtikhon/use_kill_after_term branch December 6, 2020 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop replica with SIGKILL if SIGTERM failed #231

Stop replica with SIGKILL if SIGTERM failed #231

avtikhon commented Nov 4, 2020 •

edited

Loading

Totktonada Nov 21, 2020

Totktonada Nov 21, 2020

Totktonada commented Dec 6, 2020

		self.signal_default = signal.SIGTERM
		self.signal_kill = signal.SIGKILL

Stop replica with SIGKILL if SIGTERM failed #231

Stop replica with SIGKILL if SIGTERM failed #231

Conversation

avtikhon commented Nov 4, 2020 • edited Loading

Totktonada Nov 21, 2020

Choose a reason for hiding this comment

Totktonada Nov 21, 2020

Choose a reason for hiding this comment

Totktonada commented Dec 6, 2020

avtikhon commented Nov 4, 2020 •

edited

Loading