-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix panic when restart after removeMember #13479
fix panic when restart after removeMember #13479
Conversation
PTAL @ptabor |
@serathius @ptabor This issue can't be reproduced on 3.5.1, so no need to get it included in the plans for v3.5.2. Please see my comments issuecomment-973757868 and issuecomment-975024314. But I agree that it is indeed an issue in the main branch (3.6). |
@@ -339,24 +339,29 @@ func (t *Transport) RemoveAllPeers() { | |||
|
|||
// the caller of this function must have the peers mutex. | |||
func (t *Transport) removePeer(id types.ID) { | |||
if peer, ok := t.peers[id]; ok { | |||
peer, ok := t.peers[id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd better add a comment something like below:
etcd may remove a member again on startup due to WAL files replaying.
Please also squash the two commits and rebase this PR. |
// etcd may remove a member again on startup due to WAL files replaying. // (see: etcd-io#13479)
Three submits are inserted with a lot of Push, Squash failed, and re-raised a PR to integrate commit. |
I used the ETCD3.5 version to stably reproduce this problem, and the process is divided into the following steps:
use this code quickly do an auto snapshot: Do snapshot now #13474
#13466
The logic of the overall error is basically clear. The key is to take a snapshot before deleting the node.
Then the biggest problem is in the above code. If the index of the snapshot is less than the ConsistentIndex, start with the old backend. The old backend has the latest member information (including the deleted node information), but the WAL playback is based on the snapshotIndex. Play back at the starting position, so the removeMember raft changeConfig will be played back.