-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: consistency check failure during import #36861
Comments
We don't have the diff because I messed something up during recent refactors. Running
now and then I'll bring the cluster up to recover the diff. |
Currently the range seems stats-consistent and also passed enqueuing through the full consistency checker. The replicas are still n2,n4,n5 (replicaIDs 1 3 4), so nothing has shifted. I will dig some more but this could mean that we were looking at a false positive of sorts, or a temporary inconsistency that resolved itself. I'm also not so sure that I messed up something regarding the consistency checker diff. What's really weird is that typically, the queue first computes a checksum without the diff. Then, when that has finished and found mismatching SHAs, it computes another round this time with a diff, then prints that, and fatals. In the logs, we're supposed to see the log message from line 226. cockroach/pkg/storage/replica_consistency.go Lines 224 to 231 in a078fd2
I don't see that in the (combined) logs from the cluster. But the stack trace shows that we are in the recursive call from on line 229. A clue is that the fatal error is in its own, new, log file. There aren't any past log files. It looks like the logging messed something up; we effectively know nothing about n5 (nathan-tpcc-geo:7) before the fatal; the fatal message is the very first thing we have.
I was going to suggest that the diff that was printed was so large that it forced a log file rotation and triggered a bug through that, but r590 is actually basically empty -- in Nathan's RangeStatus it has 12 system keys and that's it, now it has 16. The diff couldn't be large. And even if so, I don't know of any bugs regarding log rotation, especially not ones that would completely wipe out any prior log files. cockroach/pkg/util/log/clog.go Lines 1102 to 1110 in d0f758a
I even checked Nathan's |
I ran the following for
and it gets, maybe, more interesting. The ones for n=2,5 are identical. The one for n=7 is very different, which makes some sense since it's only applied up to log position 23, which is presumably where it crashed (22 is the consistency computation at GMT Tuesday, April 16, 2019 1:58:45 AM, 23 an empty entry). (The other two diffs are at 28). Interestingly, at log 23 (on that node) there's tons of data. On the other nodes, a range deletion at index 27 (reproposed at 28) at time 1555379948806436077 (=GMT Tuesday, April 16, 2019 1:59:08 AM) nuked the range empty (save for some sys keys). raft.2.txt What I've done so far seems to show that when I don't know exactly what the range deletions were caused by, but my assumption would be that the node crashing failed the import, and that the import tried to clean up by nuking the keyspace empty. This happened within a minute of the crash, so we must expect inconsistencies like that to be wiped out on the regular (though usually we'll at least get the diff). This in turn indicates that we'll want to disable this behavior in our nightly tests. Or even better, we need to take a cluster-wide RocksDB checkpoint. I know @andreimatei was looking at this; this seems high priority now. On the other hand, since the resulting stats are consistent, this indicates that whatever the diff would have been likely didn't include the MVCCStats key. I'm adding a little TODO list in the top post to summarize the action items/things to understand. Unfortunately, however, it doesn't seem like we're ever going to be able to find out what the reported inconsistency actually was in this case. |
@ajkr this one's shaping up to be another puzzler -- it's similar to that private issue regarding the time series in that the stats don't seem to have diverged. |
Oh no.. I think I might understand why the log file is missing: cockroach/pkg/util/log/clog.go Lines 1318 to 1335 in d0f758a
I'd say it's plausible then to assume that we wrote a large diff and then shot ourselves in the foot by GC'ing it right away. |
We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None
A few runs, nothing so far. More tomorrow. |
Please pick up #36867 as well, or we may get a repro that isn't actionable, like we did before. |
We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None
@nvanbenschoten or @andreimatei could either of you copy the |
I already killed the cluster 😬 |
You sure?
|
Oh, I guess it got recreated! Nvm then. Probably wasn't ever going to look at it again anyway. |
Luckily I just hit this again on a new version of I wasn't trying to reproduce this, which is why I didn't have any debug info enabled. All future attempts will be run with #36867 and an increased |
The only similarity I see is that like in the first case, this consistency check failure was hit almost immediately after a split. In this case, we see the following entries:
Notice that term 6, index 11 is the first raft log term/position combo allowed. That indicates that this range had just split off from its parent range. |
I got a bunch of repros (6 before I stopped the charade). I was running 8 node clusters across 4 GCP zones, 4-cpu machines.
More looking tomorrow. |
andrei-1555535390-11-n8cpu4-geo(n1,s1):2:
(n4,s4):3:
The checkpoints worked like a charm. On n4 (the leaseholder with the weird stats) the Raft log reveals... nothing. It's recently been truncated, so all there is is the queue activity. So let's look at what's actually in r871 on n4:
See something? The data actually in the replica has actual stats matching that of the follower. The leaseholder persistently has the stats messed up, so this looks like a true divergence. Here are the logs for r871 (before the inconsistency): https://gist.github.com/tbg/779bd8813ea8dec2d45579ca9d3e8c2d There's nothing related to n1 and n4, but a little while before the crash we remove the replica on n5 which results in this message:
This suggests that n5 was messed up similar to n1. |
andrei-1555535390-13-n8cpu4-geoOn n6 (leaseholder) the stats agree with recomputation, so there's tons of data. The Raft log is basically only log truncations and the consistency checks again. On n1 (follower), the range stats key is
i.e. very very empty. And guess what? It also agrees with recomputation (here's the dump of the data in it). |
This looks like there are two different failure modes already. In the first case, the leaseholder (and probably another follower, since removed) had the same data, but diverging stats. Tentatively speaking, it seems as if the leaseholder had at least forgotten to add the stats for the data it ingested at some point, though the fact that its stats are negative indicates that more might be going on. In the second case, both replicas are internally consistent (i.e. stats match data) but one replica is missing tons of stuff. This looks more classically like what you'd expect if a replica "skipped" a raft command erroneously, similar to #35424. It's frustrating to not have the Raft logs, I think we'd really need them to tell us what's up. @andreimatei can you disable the raft log queue and see if you can get a repro that way? |
BTW, in Nathan's original issue, it seems like yet another failure mode: the stats were the same, but one node was missing data (and this was then swept under the rug by the range deletions). |
for another addendum, there's no clear argument that the inconsistencies immediately followed splits. In Andrei's first repo, the split happens a minute before the inconsistency is detected. Shortly after the split we see the (then) leaseholder n5 send a snapshot to n6. That snapshot is basically empty. n5 is later removed and results in the inconsistent log line claiming to have removed a large negative number of keys. This number is just the (in-mem) stats of the range, so it got really messed up. This points at a split, as nothing else would've generated a negative delta. @andreimatei let's try flipping this assertion on cockroach/pkg/storage/replica_raft.go Line 2021 in 1cd61f9
In the second repro, I can't find anything suspicious-looking; there are empty-looking snapshots right after the split so if those are to be trusted, the range starts out empty. It's worth pointing out that since Need more repros! But this problem is way worse than I anticipated, but since it repros so easily I doubt that we'll have trouble tracking it down given enough time. BTW @andreimatei can you always share the SHAs that you're running? Right now I don't even know if it's master or release-19.1. |
I found Andrei's branch, so I'm gonna start a repro cycle of my own. |
Up and running, my branches are crdb binary: https://github.com/tbg/cockroach/tree/repro/import-geo-36861 and the command
|
No consistency check failure in sight.
Heh, @tbg actually picked up on the need for this invariant back when we introduced this inlining code: 9170399#diff-0f4e61b240de001f1b97a87fc31af043R341.
Agreed. The "happy case" doesn't seem like it's buying us all that much since it only applies on the leaseholder. |
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
I'd also rather fix the bug. |
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten Fixes #36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
Reopening until we've added the geo import test to our suite. What cadence should we be running that? It seems to tickle new behavior. From what I saw during the repros, the geo distribution causes costly reproposals.I assume we'll get value out of observing this test. |
This should avoid losing the diff printed to the logs, as observed in: cockroachdb#36861 We don't necessarily care about the large diffs (though they can't hurt to have around) but about whatever else is in the deleted logs. Without logs, debugging is typically much more difficult. Release note: None
36580: roachprod: rationalize refreshing of DNS entries r=andreimatei a=andreimatei Before this patch, the DNS entries were refreshed even if we hadn't properly listed the AWS clusters, which would cause their DNS entries to be removed inadvertently. This patch lifts the decision about whether to refresh or not out of the GCE code and into the top level. Release note: None 36950: storage: disable log file size limit before consistency fatal r=andreimatei a=tbg This should avoid losing the diff printed to the logs, as observed in: #36861 We don't necessarily care about the large diffs (though they can't hurt to have around) but about whatever else is in the deleted logs. Without logs, debugging is typically much more difficult. Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>
Closes cockroachdb#36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with cockroachdb#36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from cockroachdb#36939. Release note: None
We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None
This should avoid losing the diff printed to the logs, as observed in: cockroachdb#36861 We don't necessarily care about the large diffs (though they can't hurt to have around) but about whatever else is in the deleted logs. Without logs, debugging is typically much more difficult. Release note: None
36997: roachtest: create new import/tpcc/warehouses=4000/geo test r=nvanbenschoten a=nvanbenschoten Closes #36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with #36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from #36939. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Closes cockroachdb#36861. This test runs a geo-distributed import in the same configuration as the one we saw cause issues with cockroachdb#36861. I'm testing now to see if this will actually trigger the consistency failure without the fix from cockroachdb#36939. Release note: None
We've repeatedly wanted this to preserve state when finding replica inconsistencies. See, for example: cockroachdb#36861 Release note: None
To do / understand
nathan-tpcc-geo:7
end up rotating into a new log file to log the fatal error without retaining any prior ones? (answer: log rotation storage: consistency check failure during import #36861 (comment))This looks very similar to #35424, so it's possible that that issue wasn't fully resolved. I was most of the way through a TPC-C 4k import when a node died due to a consistency check failure.
Cockroach SHA: 3ebed10
Notes:
Cluster:
nathan-tpcc-geo
(stopped, extended for 48h)Cockroach nodes:
1,2,4,5,7,8,10,11
Inconsistent range:
r590
Replicas:
nathan-tpcc-geo:2/n2/r3
,nathan-tpcc-geo:5/n4/r4
, andnathan-tpcc-geo:7/n5/r1
Inconsistent replica:
nathan-tpcc-geo:7/n5/r1
Replicas in zones:
europe-west2-b
,europe-west4-b
, andasia-northeast1-b
respectivelyInitial Investigation
Unlike in the later reproductions of #35424, replica 1's Raft log is an exact prefix of replica 3 and 4's, so this doesn't look like the same issue we saw later in that issue.
I haven't looked at much else yet.
r590 Range _ Debug _ Cockroach Console.pdf
The text was updated successfully, but these errors were encountered: