-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop replica with SIGKILL if SIGTERM failed #231
Conversation
1ae048b
to
d3bfc60
Compare
self.signal_default = signal.SIGTERM | ||
self.signal_kill = signal.SIGKILL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason to omitting direct usage of signal.SIGTERM
and signal.SIGKILL
.
# check if the process died, otherwise if SIGTERM failed use SIGKILL | ||
try: | ||
self.process.send_signal(0) | ||
except OSError: | ||
pass | ||
else: | ||
if signal == self.signal_default: | ||
signal = self.signal_kill | ||
color_log('Sending signal {0} ({1}) to process {2}\n'.format( | ||
signal, signame(signal), self.process.pid)) | ||
self.process.send_signal(signal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I don't understand, how you suppose I should review the patch. There are patches, which are in the other PRs (and I don't see changes of the same place in the code, at least not here). Should I look only at the commit, whose title is equal to the PR title? I guess so. But I don't understand, why other five are here.
First, the same as for PR #228, PR #229 and PR #230:
- A normal termination fails. It unlikely is the expected situation. So a test, which asks for stopping the instance, should receive an error in the case. At least a warning should be printed at the terminal. It should not be a silent attempt to 'fix' an incorrect behaviour of tarantool or a test: it is not what a testing system should do.
Second, I'm not sure that the provided analysis is correct:
- I run the same test many times and I didn't see any instances that would not be successfully killed by
killall tarantool
afterwards (which sendsSIGTERM
). - When the test hangs, I see in
ps ax
one or two replicasets, where one of instances in is orphan mode. Nothing like 'stuck instance from an old replicaset + 3 from the new'. Or, maybe, it is '1 stuck from old + 2 from the new'? Anyway, I need some prove of your theory. E.g. more logging in test-run, that would reveals a problem.
Third, I don't understand how it supposed to work.
self.wait_stop()
waits for the process usingself.process.wait()
, which should not return until the process will be terminated (according to the docs). How it is possible that it returned, but the process is alive? If you found a prove that it really so, let's share it.send_signal(0)
will or will not returnOSError
depending of whether it was pooled bypool()
,wait()
,communicate()
or so (I verified it on Linux). And this behaviour is not specified at all in the docs for Python 2. Moreover, Python 3 fixes this inconsistency and its docs saysDo nothing if the process completed
(and I verified it too, it is so). If even we'll rely on the unspecified and possibly platform-specific behaviour of Python 2, the code will become silently break after the upcoming Python 3 enabling (it is in work again, AFAIK). I guess you confuses<Popen_object>.send_signal()
withos.kill()
.- There is @rtokarev's patch (PR Add the 'timeout' option into 'stop server' command and kill a server… #186) with the same idea in the mind, but it at least looks workable in theory.
- Okay, maybe I miss something obvious? I applied the patch (only this one, without others in the PR) and the testing hangs just like before. See below, I need a method to verify the changes.
And the last:
- It seems, your way to reveal problem roots and verify changes is not sufficient. Let's define a process of local testing of test-run changes of this kind (regarding flaky test fails), which will prove that:
- the problem exists;
- the exact problem is shown by some logs;
- the problem is resolved after the patch (and the logs prove this too).
Sasha, please, take the 8th point very seriously. If you'll not give proofs that you correctly spotted a problem and actually fix it, I'll be obligated to do it as part of a review. This way any available time for reviewing your patches will reach its end soon (because I have other tasks with its own deadlines).
Of course, in theory I could just verify the code correctness without attempting to understand, whether it'll solve the problem that is given as the reason of the change. However if everything that I written above is more or less truth (and I don't missed something very obvious), I should ask for proofs or find them myself. Sorry for such tight criterias, but I don't see any other way to proceed with patches of this kind. It is not something, where there is obvious way to verify the changes, so development of a verification method is the integral part of the task.
Found that if the previous test leaves process of the created cluster replica then the next same test fails on its recreation: [101] replication/ddl.test.lua memtx [101] [101] [Instance "ddl3" returns with non-zero exit code: 1] [101] [101] Last 15 lines of Tarantool Log file [Instance "ddl3"][/tmp/tnt/101_replication/ddl3.log]: ... [101] 2020-11-03 10:31:10.589 [6557] main/115/applier/cluster@unix/:/private/tmp/tnt/101_replication/autobootstrap4.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode. [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-onlymode. [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [101] [ fail ] [101] Test "replication/ddl.test.lua", conf: "memtx" [101] from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ... [101] replication/ddl.test.lua memtx [ fail ] [101] Test "replication/ddl.test.lua", conf: "memtx" [101] from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ... [101] replication/ddl.test.lua memtx [ fail ] [101] Test "replication/ddl.test.lua", conf: "memtx" [101] from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ... ... [101] [Instance "ddl1" returns with non-zero exit code: 1] [101] [101] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/101_replication/ddl1.log]: [101] The daemon is already running: PID 6553 [101] Starting instance ddl1... [101] The daemon is already running: PID 6553 [101] Starting instance ddl1... ... Found that test-run uses 'tarantoolctl start' call to start the instance, but if the previous test failed then on rerun it should be stopped before it. To fix it the call changed to 'tarantoolctl restart'.
Found that if the previous test leaves process of the created cluster replica then the next same test fails on its recreation [1]: [047] replication/election_qsync_stress.test.lua vinyl [047] [047] [Instance "election_replica2" returns with non-zero exit code: 1] [047] [047] Last 15 lines of Tarantool Log file [Instance ... [047] 2020-11-05 13:19:25.941 [29831] main/114/applier/unix/:/private/tmp/tnt/047_replication/election_replica3.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode. [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode. [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode. [047] [ fail ] [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl" [047] from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ... [047] replication/election_qsync_stress.test.lua vinyl [ fail ] [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl" [047] from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ... [047] replication/election_qsync_stress.test.lua vinyl [ fail ] ... To fix the issue replica should be killed with signal SIGKILL if SIGTERM signal didn't kill it's process. [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831786472#L5060
d3bfc60
to
daed116
Compare
Superseded by PR #186. |
Fixed 2 issues:
Stop replica with SIGKILL if SIGTERM failed
Found that if the previous test leaves process of the created cluster
replica then the next same test fails on its recreation [1]:
To fix the issue replica should be killed with signal SIGKILL if
SIGTERM signal didn't kill it's process.
[1] - https://gitlab.com/tarantool/tarantool/-/jobs/831786472#L5060
Use restart instead of start on instance creation
Found that if the previous test leaves process of the created cluster
replica then the next same test fails on its recreation:
Found that test-run uses 'tarantoolctl start' call to start the
instance, but if the previous test failed then on rerun it should
be stopped before it. To fix it the call changed to 'tarantoolctl
restart'.