Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: panic from raftgroup.Campaign call to becomeLeader #41588

Closed
peymanh opened this issue Oct 15, 2019 · 9 comments
Closed

storage: panic from raftgroup.Campaign call to becomeLeader #41588

peymanh opened this issue Oct 15, 2019 · 9 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination.

Comments

@peymanh
Copy link

peymanh commented Oct 15, 2019

Hi, we have 3 container replicas of cockroachdb on 3 different servers and manage them by docker swarm. Today after deploying latest changes to the system, suddenly our second container got stuck in "assigned to running" state.

I tried to debug the problem by seeing the container logs. this is the result of docker service logs roach2 --tail 100 -f command:

cocroach

Environment:

  • CockroachDB version 2.1.5
  • Server OS: UBUNTU
  • Client app cockroach sql
@tbg
Copy link
Member

tbg commented Oct 16, 2019

Hi @peymanh, what was it that you did in

Today after deploying latest changes to the system,

Was this updating the version of CockroachDB, or deploying your app again? Am I understanding correctly that this node will crash in the same way every time you attempt to restart it?

Technical analysis so far:

The error indicates that this node has a single-member replica that is asked to campaign, but that this replica's raft group is improperly configured (I think its r.id does not match the entry it has for itself in r.prs, if it even has one).

I'd be interested in taking a look at the data directory for this node, if this is at all possible (I assume this contains production data).

@tbg
Copy link
Member

tbg commented Oct 16, 2019

@bdarnell does this error ring a bell? v2.1.5 is far from the most recent 2.1 release, but I don't remember us fixing something related to this bug. Note that v2.1.5 already has this code

func (r *Replica) maybeCampaignOnWakeLocked(ctx context.Context) {
	// Raft panics if a node that is not currently a member of the
	// group tries to campaign. That happens primarily when we apply
	// preemptive snapshots.
	if _, currentMember := r.mu.state.Desc.GetReplicaDescriptorByID(r.mu.replicaID); !currentMember {
		return
	}

@peymanh
Copy link
Author

peymanh commented Oct 16, 2019

@tbg

Hi @peymanh, what was it that you did in

Today after deploying latest changes to the system,

Was this updating the version of CockroachDB, or deploying your app again? Am I understanding correctly that this node will crash in the same way every time you attempt to restart it?

First we deployed our app again and our second replica of Cockroach went mad :(
In order to solve the problem we thought that it was due the old version of the image! So we changed it to 2.1.9. I even rebooted our second server. But changing the version didn't help either. So the problem exists in the newest version too.

Due to the healthcheck that we provided for the container, it consistently tries to restart itself. But it never succeeds to go healthy.

@tbg tbg changed the title runtime error: invalid memory address or nil pointer dereference on container replica storage: panic from raftgroup.Campaign call to becomeLeader Oct 16, 2019
@tbg
Copy link
Member

tbg commented Oct 16, 2019

Thanks @peymanh. You're not able to (privately) share the data directory with us, is that correct?

To "unblock" your issue, I would suggest taking a backup of the data dir (i.e. just copy it somewhere while the node is not running and not attempting to start). Then you can "reset" n2 by decommissioning it first (steps at https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-mark-the-dead-node-as-decommissioned) and then removing it physically (i.e. reset the data dir for that node). You should then be able to add it back to the cluster (it will get a new NodeID), and your cluster should become healthy again.

We can then discuss next steps. I would like to see the output of ./cockroach debug range-descriptors /your/data/dir/for/that/node. You'll want to send this privately as the keys printed there can leak some private information. My email is [email protected].

@bdarnell
Copy link
Contributor

@bdarnell does this error ring a bell? v2.1.5 is far from the most recent 2.1 release, but I don't remember us fixing something related to this bug.

We've seen it in the distant past (#20629, fixed before the 2.0 release) but I can't think of a more recent occurrence.

The fact that it's r1 makes me wonder about bootstrapping issues. Could the cluster have bootstrapped twice due to a missing --join flag or a repeated cockroach init? (that should have resulted in a clean error earlier in the process, but maybe there were gaps in the safety checks in 2.1. I think we made this more robust in 19.1)

@tbg
Copy link
Member

tbg commented Oct 16, 2019

Hmm, that's a good point. @peymanh we'll at least need to see the logs for n2 leading up to the first instance of the crash. Could you submit that to https://support.cockroachlabs.com/hc/en-us? Please mention this issue in your message. Thank you.

@peymanh
Copy link
Author

peymanh commented Oct 22, 2019

To "unblock" your issue, I would suggest taking a backup of the data dir (i.e. just copy it somewhere while the node is not running and not attempting to start). Then you can "reset" n2 by decommissioning it first (steps at https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-mark-the-dead-node-as-decommissioned) and then removing it physically (i.e. reset the data dir for that node). You should then be able to add it back to the cluster (it will get a new NodeID), and your cluster should become healthy again.

@tbg We took a backup of data directory and then went to use the decommissioning command in n1 (one of our alive nodes). The problem is that after hours of waiting, decommissioning procedure never ended and was just printing dots after dots.

Is there any other solution than decommissioning? We are thinking of removing our second instance container by docker service rm , removing the volume by docker prune and finally deploy the stack again. Will it be another way for unblocking our issue?

@tbg
Copy link
Member

tbg commented Oct 22, 2019

Decommissioning only works when it can move the replicas to a new home. You indicated that you initially had three servers, now one is down. So you have two running but need three running. If you re-add the down node without its data directory things should normalize (and the decommissioning should finish).

@lunevalex lunevalex added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 28, 2020
@lunevalex
Copy link
Collaborator

Going to close this based on the last comment from @tbg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination.
Projects
None yet
Development

No branches or pull requests

4 participants