-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: consistency check failed on cyan #29252
Comments
|
so it looks like this time one of the followers has a pending merge transaction, whereas the other follower and the leader don't have it. |
wait a minute, it's the other way around. Presumably leader and one follower have this pending txn, but not another follower? |
There's a second, identical, consistency failure in the logs later (which makes sense - once inconsistent, twice inconsistent). The transaction timestamp boils down to 180827 at 11:30:15, at which point we find this (note that the start key of r5911 is the anchor of the inconsistent txn record):
and that seems to go through, for we see all replicas disappear:
shortly thereafter we see r5911 itself get eaten:
then r982 gets eaten (note r7900s extended bounds in the last line)...
we can chase this down all the way, but I think it's fair to say that the damage has already been done at this point (in fact, the damage likely gets done pretty much at 180827 11:30:16). |
With some newly developed options to
I verified that node i has nodeID i, so our mystery transaction appears to around on nodeIDs 1-3. So my confusion above is confirmed: r32 is on 1,3,5 at the time of the crash, and it seems that 1 and 3 actually have this txn, it's just that it's "missing" on 5. Obviously it's weird and a bit concerning to have that transaction open in the first place, after all, it's been more than a day and the corresponding merge doesn't seem to be active any more? Time to look for a merge that would've been anchored on a replica on nodes 1-3. This is confusing, because as far as I can tell none of the merges I looked at so far live on these nodes. Here are a few that do, but none of them really match up with the timestamp (but there could've been retries; they're not logged). I don't really think that there's a big point in looking into these transactions. The merge above has the matching start key and time, so it's more than likely that it wrote the txn record, but not on n1-3.
|
I'm seeing this
shortly before 11:30:16 which tells me that mostly likely during the problematic merge transaction at 11:30:14 r5911 had descriptor But isn't something backwards here:
Check out the end keys, the expected one is larger than the actual one, but the expected one has a smaller generation than the actual one:
I'm not finding anything about any splits in these logs. |
Some of my analysis above might be muddled. I'm just realizing that on 180827, we didn't have #29079 yet (the SHA was e7d570e), though we did have #29067. Either way, the problem here can't be pinned on the consistency checker. We have these keys in three places, but these three places aren't the three places where it should (or should not) be. |
@benesch I've got to run out for a while (and I've also stared at this for too long). Time for you to hopefully figure it out :-) |
This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None
29328: cli: add hex option to debug keys r=a-robinson a=tschottdorf This was used in #29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
This was used in cockroachdb#29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. Release note: None
29390: backport-2.1: storage, roachtest, cli: assorted backports r=petermattis a=tschottdorf storage: remove test-only data race The test was stopping the engine before the stopper, so the compactor was sometimes able to use the engine while it was being closed. Fixes #29302. roachtest: improve TestMonitor Add two more test cases about the exit status of the `monitor` invocation itself. roachtest: fix flake in TestMonitor Accidentally changed this in #29178. cli: add hex option to debug keys This was used in #29252 and I imagine I'll want to use it again whenever we see the consistency checker fail in the future. storage: skip TestClosedTimestampCanServe It won't be the first one I'm looking into. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Hmm, I assume this is some bug in our mostly untested debug tools. Looking. |
Oh, great. Something is broken unmarshaling keys of the following form. I doubt those are actual range descriptors, they probably just look enough like them.
|
Working around this error, we can see the history of range descriptors on r32 on node 1's dump. First of all, here's the offending txn again:
The following range descriptors are newer than that transaction:
The above basically says that r32 moved from Now the interesting range descriptor which must have been written by the offending transactions (since it's a merge and the timestamps match):
So while located on The following descriptors are older. Looking at them in chronological order (i.e. reverse from what's here) the range first lived on
|
I bet these are just the range descriptors for ranges that have been merged away. |
It turns out that a nice property of merging two ranges is that the resulting range contains the history range descriptors for both. So in dumping r32, I get to look at all of the range descriptors that were ever merged into this range. We have the inconsistent transaction record belonging to a merge at 1535369415.873258836,0 that I think has something to do with a version of a range descriptor written at the same timestamp (which is also a merge). Let's look up the range descriptor of the RHS, for which there should be a deletion at the corresponding timestamp. The merge widened the range descriptor
into
and so it should've deleted the range descriptor at
So unfortunately (or fortunately) this is all as it should be. There was definitely a merge at this timestamp that seems to have just breezed through. It's unclear why there would be a PENDING entry for it and additionally, why that entry would've gone inconsistent. There is other data in that range, so it's not just that this record happens to be the only one around. The puzzler here is that this timestamp is GMT: Monday, July 16, 2018 6:04:07 PM, aka "way too long ago":
We wouldn't expect a pending txn to stick around for that long, and having it PENDING implies that it wrote an intent (since it wouldn't otherwise write its txn record) at |
The timing of these snapshots is instructive.
We see a first preemptive snapshot for r10033 right when r10033 is merging its neighbor, which is probably a first attempt at |
Well well well. n9 is getting a preemptive snapshot for the right start key... but it applies the snapshot with a totally wrong start key:
|
Specifically it is somehow using its old end key (from the first snapshot) as the start key. o_O |
So somehow r10033 transformed itself into the right neighbor it subsumed via a snapshot. |
I'm honestly a little surprised that a consistency check is where this fails. Shouldn't some assertion fire? |
|
|
So is it likely that this snapshot
which seems to reflect the correct post-first-merge r10033, somehow addressed only the subsumed portion? Or is the error on the receiving end? |
Noticed during cockroachdb#29252. Release note: None
Ok, my mind is blown:
The outgoing snapshot looks correct and yet the incoming snapshot has the busted range descriptor! o_O |
Are you trusting this code? In particular, there's |
HA! @nvanbenschoten managed to spot the bug in like 30s of looking at this code. It's here: cockroach/pkg/storage/store_snapshot.go Line 412 in 6ff2995
We're just mutating the header range descriptor directly instead of making a copy of it. 🤦♂️ |
Hmm, |
❤️ ❤️ ❤️ Glad this got done so quickly. |
…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None
29646: roachtest: better merge testing in clearrange r=benesch a=tschottdorf Unfortunately, the method to determine the range count is quite slow since crdb_internal.ranges internally sends an RPC for each range to determine the leaseholder. Anecdotally, I've seen ~25% of the merges completed after less than 15 minutes. I know that it's slowing down over time, but @benesch will fix that. Also throws in aggressive consistency checks so that when something goes out of sync, we find out right there. Release note: None 29677: storage: preserve consistency when applying widening preemptive snapshots r=benesch a=benesch Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes #29252. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>
…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None
We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None
…hots Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes cockroachdb#29252. Release note: None
We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None
https://sentry.io/cockroach-labs/cockroachdb/issues/666877375/
The text was updated successfully, but these errors were encountered: