Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix panic when restart after removeMember #13479

Closed

Conversation

yangxuanjia
Copy link
Contributor

@yangxuanjia yangxuanjia commented Nov 15, 2021

I used the ETCD3.5 version to stably reproduce this problem, and the process is divided into the following steps:

  1. Establish a 3 node cluster A, B, C
  2. Write a few keys casually
  3. Make an automatic snapshot for each node
    use this code quickly do an auto snapshot: Do snapshot now #13474
  4. removeMember a node C
  5. Restart node A, -----> Panic

#13466

The logic of the overall error is basically clear. The key is to take a snapshot before deleting the node.
Then the biggest problem is in the above code. If the index of the snapshot is less than the ConsistentIndex, start with the old backend. The old backend has the latest member information (including the deleted node information), but the WAL playback is based on the snapshotIndex. Play back at the starting position, so the removeMember raft changeConfig will be played back.

@serathius serathius mentioned this pull request Dec 6, 2021
9 tasks
@serathius
Copy link
Member

PTAL @ptabor

@ahrtr
Copy link
Member

ahrtr commented Dec 17, 2021

PTAL @ptabor

@serathius @ptabor This issue can't be reproduced on 3.5.1, so no need to get it included in the plans for v3.5.2. Please see my comments issuecomment-973757868 and issuecomment-975024314.

But I agree that it is indeed an issue in the main branch (3.6).

@serathius serathius added this to the etcd-v3.6 milestone Dec 21, 2021
@@ -339,24 +339,29 @@ func (t *Transport) RemoveAllPeers() {

// the caller of this function must have the peers mutex.
func (t *Transport) removePeer(id types.ID) {
if peer, ok := t.peers[id]; ok {
peer, ok := t.peers[id]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd better add a comment something like below:
etcd may remove a member again on startup due to WAL files replaying.

@ahrtr
Copy link
Member

ahrtr commented Jan 26, 2022

Please also squash the two commits and rebase this PR.

// etcd may remove a member again on startup due to WAL files replaying.
// (see: etcd-io#13479)
@yangxuanjia
Copy link
Contributor Author

Three submits are inserted with a lot of Push, Squash failed, and re-raised a PR to integrate commit.
#13645

@ahrtr ahrtr removed this from the etcd-v3.6 milestone Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants