storage: panic from raftgroup.Campaign call to becomeLeader #41588

peymanh · 2019-10-15T08:32:00Z

Hi, we have 3 container replicas of cockroachdb on 3 different servers and manage them by docker swarm. Today after deploying latest changes to the system, suddenly our second container got stuck in "assigned to running" state.

I tried to debug the problem by seeing the container logs. this is the result of docker service logs roach2 --tail 100 -f command:

Environment:

CockroachDB version 2.1.5
Server OS: UBUNTU
Client app cockroach sql

The text was updated successfully, but these errors were encountered:

tbg · 2019-10-16T09:16:59Z

Hi @peymanh, what was it that you did in

Today after deploying latest changes to the system,

Was this updating the version of CockroachDB, or deploying your app again? Am I understanding correctly that this node will crash in the same way every time you attempt to restart it?

Technical analysis so far:

The error indicates that this node has a single-member replica that is asked to campaign, but that this replica's raft group is improperly configured (I think its r.id does not match the entry it has for itself in r.prs, if it even has one).

I'd be interested in taking a look at the data directory for this node, if this is at all possible (I assume this contains production data).

tbg · 2019-10-16T09:17:44Z

@bdarnell does this error ring a bell? v2.1.5 is far from the most recent 2.1 release, but I don't remember us fixing something related to this bug. Note that v2.1.5 already has this code

func (r *Replica) maybeCampaignOnWakeLocked(ctx context.Context) {
	// Raft panics if a node that is not currently a member of the
	// group tries to campaign. That happens primarily when we apply
	// preemptive snapshots.
	if _, currentMember := r.mu.state.Desc.GetReplicaDescriptorByID(r.mu.replicaID); !currentMember {
		return
	}

peymanh · 2019-10-16T11:58:14Z

@tbg

Hi @peymanh, what was it that you did in

Today after deploying latest changes to the system,

Was this updating the version of CockroachDB, or deploying your app again? Am I understanding correctly that this node will crash in the same way every time you attempt to restart it?

First we deployed our app again and our second replica of Cockroach went mad :(
In order to solve the problem we thought that it was due the old version of the image! So we changed it to 2.1.9. I even rebooted our second server. But changing the version didn't help either. So the problem exists in the newest version too.

Due to the healthcheck that we provided for the container, it consistently tries to restart itself. But it never succeeds to go healthy.

tbg · 2019-10-16T12:43:59Z

Thanks @peymanh. You're not able to (privately) share the data directory with us, is that correct?

To "unblock" your issue, I would suggest taking a backup of the data dir (i.e. just copy it somewhere while the node is not running and not attempting to start). Then you can "reset" n2 by decommissioning it first (steps at https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-mark-the-dead-node-as-decommissioned) and then removing it physically (i.e. reset the data dir for that node). You should then be able to add it back to the cluster (it will get a new NodeID), and your cluster should become healthy again.

We can then discuss next steps. I would like to see the output of ./cockroach debug range-descriptors /your/data/dir/for/that/node. You'll want to send this privately as the keys printed there can leak some private information. My email is [email protected].

bdarnell · 2019-10-16T14:39:53Z

@bdarnell does this error ring a bell? v2.1.5 is far from the most recent 2.1 release, but I don't remember us fixing something related to this bug.

We've seen it in the distant past (#20629, fixed before the 2.0 release) but I can't think of a more recent occurrence.

The fact that it's r1 makes me wonder about bootstrapping issues. Could the cluster have bootstrapped twice due to a missing --join flag or a repeated cockroach init? (that should have resulted in a clean error earlier in the process, but maybe there were gaps in the safety checks in 2.1. I think we made this more robust in 19.1)

tbg · 2019-10-16T15:05:16Z

Hmm, that's a good point. @peymanh we'll at least need to see the logs for n2 leading up to the first instance of the crash. Could you submit that to https://support.cockroachlabs.com/hc/en-us? Please mention this issue in your message. Thank you.

peymanh · 2019-10-22T10:32:07Z

To "unblock" your issue, I would suggest taking a backup of the data dir (i.e. just copy it somewhere while the node is not running and not attempting to start). Then you can "reset" n2 by decommissioning it first (steps at https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-mark-the-dead-node-as-decommissioned) and then removing it physically (i.e. reset the data dir for that node). You should then be able to add it back to the cluster (it will get a new NodeID), and your cluster should become healthy again.

@tbg We took a backup of data directory and then went to use the decommissioning command in n1 (one of our alive nodes). The problem is that after hours of waiting, decommissioning procedure never ended and was just printing dots after dots.

Is there any other solution than decommissioning? We are thinking of removing our second instance container by docker service rm , removing the volume by docker prune and finally deploy the stack again. Will it be another way for unblocking our issue?

tbg · 2019-10-22T10:39:02Z

Decommissioning only works when it can move the replicas to a new home. You indicated that you initially had three servers, now one is down. So you have two running but need three running. If you re-add the down node without its data directory things should normalize (and the decommissioning should finish).

lunevalex · 2020-07-28T22:45:38Z

Going to close this based on the last comment from @tbg

jordanlewis assigned tbg Oct 15, 2019

tbg changed the title ~~runtime error: invalid memory address or nil pointer dereference on container replica~~ storage: panic from raftgroup.Campaign call to becomeLeader Oct 16, 2019

lunevalex added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 28, 2020

lunevalex closed this as completed Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: panic from raftgroup.Campaign call to becomeLeader #41588

storage: panic from raftgroup.Campaign call to becomeLeader #41588

peymanh commented Oct 15, 2019 •

edited

Loading

tbg commented Oct 16, 2019

tbg commented Oct 16, 2019

peymanh commented Oct 16, 2019 •

edited

Loading

tbg commented Oct 16, 2019

bdarnell commented Oct 16, 2019

tbg commented Oct 16, 2019

peymanh commented Oct 22, 2019 •

edited

Loading

tbg commented Oct 22, 2019

lunevalex commented Jul 28, 2020

storage: panic from raftgroup.Campaign call to becomeLeader #41588

storage: panic from raftgroup.Campaign call to becomeLeader #41588

Comments

peymanh commented Oct 15, 2019 • edited Loading

tbg commented Oct 16, 2019

tbg commented Oct 16, 2019

peymanh commented Oct 16, 2019 • edited Loading

tbg commented Oct 16, 2019

bdarnell commented Oct 16, 2019

tbg commented Oct 16, 2019

peymanh commented Oct 22, 2019 • edited Loading

tbg commented Oct 22, 2019

lunevalex commented Jul 28, 2020

peymanh commented Oct 15, 2019 •

edited

Loading

peymanh commented Oct 16, 2019 •

edited

Loading

peymanh commented Oct 22, 2019 •

edited

Loading