-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: panic in raft.campaign
#20629
Comments
This appears to be related to table creation: it happens more often in multi-table tests (bank-multitable, g2, sequential), although it has been seen in the single-table register test. Nemeses shouldn't matter here because it happens before the nemeses start. In each instance I've examined, it occurs on The stack trace indicates that at the time we initialize our raft group, raft believes the group has one member, and it's not us. This is presumably some sort of race between the table-creation splits and the initial upreplication. |
Ah, there may not be a nemesis, but there's this restart. So we're killing the cluster immediately after creating a database (which creates the split here). The "initiating a split at" message is seen in the logs right before node 1 is killed in some of the instances, but not all because jepsen uses I'll trace through one of the panics: We see node 2 applying many preemptive snapshots, suggesting that the initial splits all happened on node 1 before the cluster became replicated:
As the initial splitting and rebalancing finishes, node 2 gains the lease of r17 (Table/20-Max), removes a replica on node 5, then initiates a split at Table/50. It is then killed (potentially losing log messages) and restarts:
Node 5 was successfully removed and nothing else happened in its logs; I don't think it is relevant to this story.
r17 and r31 should initially have n1, n2, and n3 as members. Panics later occur on n2 and n3. After the restart, n1 gets the lease for r17 (note that this is after the split finished:
This is the only mention of r17 or r31 on any node after the restart until the panics.
|
Found it! cockroach/pkg/storage/replica_raftstorage.go Lines 64 to 67 in fae9323
But after a split, the cockroach/pkg/storage/store.go Line 1831 in fae9323
We need to either move the call to |
Good find! I think it should be straightforward to make the synthesis a pre-apply trigger similar to the one for SSTable ingestion. |
Prior to this change, an ill-timed crash (between applying the raft command and calling splitPostApply) would leave the replica in a persistently broken state (no HardState). Found via jepsen. Fixes cockroachdb#20629 Fixes cockroachdb#20494 Release note (bugfix): Fixed a replica corruption that could occur if a process crashed in the middle of a range split.
Prior to this change, an ill-timed crash (between applying the raft command and calling splitPostApply) would leave the replica in a persistently broken state (no HardState). Found via jepsen. Fixes cockroachdb#20629 Fixes cockroachdb#20494 Release note (bugfix): Fixed a replica corruption that could occur if a process crashed in the middle of a range split.
Prior to this change, an ill-timed crash (between applying the raft command and calling splitPostApply) would leave the replica in a persistently broken state (no HardState). Found via jepsen. Fixes cockroachdb#20629 Fixes cockroachdb#20494 Release note (bugfix): Fixed a replica corruption that could occur if a process crashed in the middle of a range split.
From #20494 (this has been seen in jepsen runs with multiple nemeses including strobe-skews and majority-ring):
The text was updated successfully, but these errors were encountered: