-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: panic from raftgroup.Campaign call to becomeLeader #41588
Comments
Hi @peymanh, what was it that you did in
Was this updating the version of CockroachDB, or deploying your app again? Am I understanding correctly that this node will crash in the same way every time you attempt to restart it? Technical analysis so far: The error indicates that this node has a single-member replica that is asked to campaign, but that this replica's raft group is improperly configured (I think its I'd be interested in taking a look at the data directory for this node, if this is at all possible (I assume this contains production data). |
@bdarnell does this error ring a bell? v2.1.5 is far from the most recent 2.1 release, but I don't remember us fixing something related to this bug. Note that v2.1.5 already has this code func (r *Replica) maybeCampaignOnWakeLocked(ctx context.Context) {
// Raft panics if a node that is not currently a member of the
// group tries to campaign. That happens primarily when we apply
// preemptive snapshots.
if _, currentMember := r.mu.state.Desc.GetReplicaDescriptorByID(r.mu.replicaID); !currentMember {
return
} |
First we deployed our app again and our second replica of Cockroach went mad :( Due to the |
Thanks @peymanh. You're not able to (privately) share the data directory with us, is that correct? To "unblock" your issue, I would suggest taking a backup of the data dir (i.e. just copy it somewhere while the node is not running and not attempting to start). Then you can "reset" n2 by decommissioning it first (steps at https://www.cockroachlabs.com/docs/stable/remove-nodes.html#step-2-mark-the-dead-node-as-decommissioned) and then removing it physically (i.e. reset the data dir for that node). You should then be able to add it back to the cluster (it will get a new NodeID), and your cluster should become healthy again. We can then discuss next steps. I would like to see the output of |
We've seen it in the distant past (#20629, fixed before the 2.0 release) but I can't think of a more recent occurrence. The fact that it's |
Hmm, that's a good point. @peymanh we'll at least need to see the logs for n2 leading up to the first instance of the crash. Could you submit that to https://support.cockroachlabs.com/hc/en-us? Please mention this issue in your message. Thank you. |
@tbg We took a backup of data directory and then went to use the decommissioning command in n1 (one of our alive nodes). The problem is that after hours of waiting, decommissioning procedure never ended and was just printing dots after dots. Is there any other solution than decommissioning? We are thinking of removing our second instance container by |
Decommissioning only works when it can move the replicas to a new home. You indicated that you initially had three servers, now one is down. So you have two running but need three running. If you re-add the down node without its data directory things should normalize (and the decommissioning should finish). |
Going to close this based on the last comment from @tbg |
Hi, we have 3 container replicas of cockroachdb on 3 different servers and manage them by docker swarm. Today after deploying latest changes to the system, suddenly our second container got stuck in "assigned to running" state.
I tried to debug the problem by seeing the container logs. this is the result of docker service logs roach2 --tail 100 -f command:
Environment:
cockroach sql
The text was updated successfully, but these errors were encountered: