Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop replica with SIGKILL if SIGTERM failed #231

Closed
wants to merge 2 commits into from

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Nov 4, 2020

Fixed 2 issues:

  1. Stop replica with SIGKILL if SIGTERM failed

    Found that if the previous test leaves process of the created cluster
    replica then the next same test fails on its recreation [1]:

      [047] replication/election_qsync_stress.test.lua      vinyl
      [047]
      [047] [Instance "election_replica2" returns with non-zero exit code: 1]
      [047]
      [047] Last 15 lines of Tarantool Log file [Instance
      ...
      [047] 2020-11-05 13:19:25.941 [29831] main/114/applier/unix/:/private/tmp/tnt/047_replication/election_replica3.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
      [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
      [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
      [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
      [047] [ fail ]
      [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
      [047]         from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ...
      [047] replication/election_qsync_stress.test.lua      vinyl           [ fail ]
      [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
      [047]         from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ...
      [047] replication/election_qsync_stress.test.lua      vinyl           [ fail ]
      ...
    

    To fix the issue replica should be killed with signal SIGKILL if
    SIGTERM signal didn't kill it's process.

    [1] - https://gitlab.com/tarantool/tarantool/-/jobs/831786472#L5060

  2. Use restart instead of start on instance creation

    Found that if the previous test leaves process of the created cluster
    replica then the next same test fails on its recreation:

      [101] replication/ddl.test.lua                        memtx
      [101]
      [101] [Instance "ddl3" returns with non-zero exit code: 1]
      [101]
      [101] Last 15 lines of Tarantool Log file [Instance "ddl3"][/tmp/tnt/101_replication/ddl3.log]:
      ...
      [101] 2020-11-03 10:31:10.589 [6557] main/115/applier/cluster@unix/:/private/tmp/tnt/101_replication/autobootstrap4.sock
           box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
      [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-onlymode.
      [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
      [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
      [101] [ fail ]
      [101] Test "replication/ddl.test.lua", conf: "memtx"
      [101]         from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
      [101] replication/ddl.test.lua                        memtx           [ fail ]
      [101] Test "replication/ddl.test.lua", conf: "memtx"
      [101]         from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
      [101] replication/ddl.test.lua                        memtx           [ fail ]
      [101] Test "replication/ddl.test.lua", conf: "memtx"
      [101]         from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
      ...
      [101] [Instance "ddl1" returns with non-zero exit code: 1]
      [101]
      [101] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/101_replication/ddl1.log]:
      [101] The daemon is already running: PID 6553
      [101] Starting instance ddl1...
      [101] The daemon is already running: PID 6553
      [101] Starting instance ddl1...
      ...
    

    Found that test-run uses 'tarantoolctl start' call to start the
    instance, but if the previous test failed then on rerun it should
    be stopped before it. To fix it the call changed to 'tarantoolctl
    restart'.

@avtikhon avtikhon requested a review from Totktonada November 4, 2020 12:39
@avtikhon avtikhon force-pushed the avtikhon/use_kill_after_term branch 4 times, most recently from 1ae048b to d3bfc60 Compare November 5, 2020 11:19
Comment on lines +819 to +820
self.signal_default = signal.SIGTERM
self.signal_kill = signal.SIGKILL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to omitting direct usage of signal.SIGTERM and signal.SIGKILL.

Comment on lines +985 to +995
# check if the process died, otherwise if SIGTERM failed use SIGKILL
try:
self.process.send_signal(0)
except OSError:
pass
else:
if signal == self.signal_default:
signal = self.signal_kill
color_log('Sending signal {0} ({1}) to process {2}\n'.format(
signal, signame(signal), self.process.pid))
self.process.send_signal(signal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't understand, how you suppose I should review the patch. There are patches, which are in the other PRs (and I don't see changes of the same place in the code, at least not here). Should I look only at the commit, whose title is equal to the PR title? I guess so. But I don't understand, why other five are here.

First, the same as for PR #228, PR #229 and PR #230:

  1. A normal termination fails. It unlikely is the expected situation. So a test, which asks for stopping the instance, should receive an error in the case. At least a warning should be printed at the terminal. It should not be a silent attempt to 'fix' an incorrect behaviour of tarantool or a test: it is not what a testing system should do.

Second, I'm not sure that the provided analysis is correct:

  1. I run the same test many times and I didn't see any instances that would not be successfully killed by killall tarantool afterwards (which sends SIGTERM).
  2. When the test hangs, I see in ps ax one or two replicasets, where one of instances in is orphan mode. Nothing like 'stuck instance from an old replicaset + 3 from the new'. Or, maybe, it is '1 stuck from old + 2 from the new'? Anyway, I need some prove of your theory. E.g. more logging in test-run, that would reveals a problem.

Third, I don't understand how it supposed to work.

  1. self.wait_stop() waits for the process using self.process.wait(), which should not return until the process will be terminated (according to the docs). How it is possible that it returned, but the process is alive? If you found a prove that it really so, let's share it.
  2. send_signal(0) will or will not return OSError depending of whether it was pooled by pool(), wait(), communicate() or so (I verified it on Linux). And this behaviour is not specified at all in the docs for Python 2. Moreover, Python 3 fixes this inconsistency and its docs says Do nothing if the process completed (and I verified it too, it is so). If even we'll rely on the unspecified and possibly platform-specific behaviour of Python 2, the code will become silently break after the upcoming Python 3 enabling (it is in work again, AFAIK). I guess you confuses <Popen_object>.send_signal() with os.kill().
  3. There is @rtokarev's patch (PR Add the 'timeout' option into 'stop server' command and kill a server… #186) with the same idea in the mind, but it at least looks workable in theory.
  4. Okay, maybe I miss something obvious? I applied the patch (only this one, without others in the PR) and the testing hangs just like before. See below, I need a method to verify the changes.

And the last:

  1. It seems, your way to reveal problem roots and verify changes is not sufficient. Let's define a process of local testing of test-run changes of this kind (regarding flaky test fails), which will prove that:
    • the problem exists;
    • the exact problem is shown by some logs;
    • the problem is resolved after the patch (and the logs prove this too).

Sasha, please, take the 8th point very seriously. If you'll not give proofs that you correctly spotted a problem and actually fix it, I'll be obligated to do it as part of a review. This way any available time for reviewing your patches will reach its end soon (because I have other tasks with its own deadlines).

Of course, in theory I could just verify the code correctness without attempting to understand, whether it'll solve the problem that is given as the reason of the change. However if everything that I written above is more or less truth (and I don't missed something very obvious), I should ask for proofs or find them myself. Sorry for such tight criterias, but I don't see any other way to proceed with patches of this kind. It is not something, where there is obvious way to verify the changes, so development of a verification method is the integral part of the task.

Found that if the previous test leaves process of the created cluster
replica then the next same test fails on its recreation:

  [101] replication/ddl.test.lua                        memtx
  [101]
  [101] [Instance "ddl3" returns with non-zero exit code: 1]
  [101]
  [101] Last 15 lines of Tarantool Log file [Instance "ddl3"][/tmp/tnt/101_replication/ddl3.log]:
  ...
  [101] 2020-11-03 10:31:10.589 [6557] main/115/applier/cluster@unix/:/private/tmp/tnt/101_replication/autobootstrap4.sock
       box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-onlymode.
  [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [101] 2020-11-03 10:31:10.589 [6557] main/103/ddl3 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [101] [ fail ]
  [101] Test "replication/ddl.test.lua", conf: "memtx"
  [101] 	from "fragile" list failed with results file checksum: "a006d40205b9a67ddbbb8206b4e1764c", rerunning with server restart ...
  [101] replication/ddl.test.lua                        memtx           [ fail ]
  [101] Test "replication/ddl.test.lua", conf: "memtx"
  [101] 	from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
  [101] replication/ddl.test.lua                        memtx           [ fail ]
  [101] Test "replication/ddl.test.lua", conf: "memtx"
  [101] 	from "fragile" list failed with results file checksum: "a3962e843889def7f61d6f1f71461bf1", rerunning with server restart ...
  ...
  [101] [Instance "ddl1" returns with non-zero exit code: 1]
  [101]
  [101] Last 15 lines of Tarantool Log file [Instance "ddl1"][/tmp/tnt/101_replication/ddl1.log]:
  [101] The daemon is already running: PID 6553
  [101] Starting instance ddl1...
  [101] The daemon is already running: PID 6553
  [101] Starting instance ddl1...
  ...

Found that test-run uses 'tarantoolctl start' call to start the
instance, but if the previous test failed then on rerun it should
be stopped before it. To fix it the call changed to 'tarantoolctl
restart'.
Found that if the previous test leaves process of the created cluster
replica then the next same test fails on its recreation [1]:

  [047] replication/election_qsync_stress.test.lua      vinyl
  [047]
  [047] [Instance "election_replica2" returns with non-zero exit code: 1]
  [047]
  [047] Last 15 lines of Tarantool Log file [Instance
  ...
  [047] 2020-11-05 13:19:25.941 [29831] main/114/applier/unix/:/private/tmp/tnt/047_replication/election_replica3.sock box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 box.cc:183 E> ER_READONLY: Can't modify data because this instance is in read-only mode.
  [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [047] 2020-11-05 13:19:25.941 [29831] main/103/election_replica2 F> can't initialize storage: Can't modify data because this instance is in read-only mode.
  [047] [ fail ]
  [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
  [047] 	from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ...
  [047] replication/election_qsync_stress.test.lua      vinyl           [ fail ]
  [047] Test "replication/election_qsync_stress.test.lua", conf: "vinyl"
  [047] 	from "fragile" list failed with results file checksum: "133676d72249c570f7124440150a8790", rerunning with server restart ...
  [047] replication/election_qsync_stress.test.lua      vinyl           [ fail ]
  ...

To fix the issue replica should be killed with signal SIGKILL if
SIGTERM signal didn't kill it's process.

[1] - https://gitlab.com/tarantool/tarantool/-/jobs/831786472#L5060
@avtikhon avtikhon force-pushed the avtikhon/use_kill_after_term branch from d3bfc60 to daed116 Compare November 21, 2020 05:34
@Totktonada
Copy link
Member

Superseded by PR #186.

@Totktonada Totktonada closed this Dec 6, 2020
@avtikhon avtikhon deleted the avtikhon/use_kill_after_term branch December 6, 2020 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants