-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: chaos during merges leads to overlapping range descriptors #33120
Comments
@benesch any idea? |
The intents seem to be OK: toblerone> I181213 12:11:16.684105 89378 storage/batcheval/cmd_end_transaction.go:466 [n1,s1,r35/1:/Table/53/1/1{64-72}] resolving /Local/Range/Table/53/1/164/RangeDescriptor That leaves me without a hypothesis on how this would've happened. |
I suppose the next step would be a test that aggressively kills nodes during merges. I realized that n2 is being killed after the first merge, not n1. In particular, it seems that it only fully applies the merge trigger for the first merge after its first restart. Its second restart's timing seems irrelevant (except that it highlights the overlap) as it's 2s after the second merge only, so everything has synced by then.
The setting to reproduce then is a 3 node cluster with a failpoint "during" the application of the merge trigger on one of the nodes. The failpoint would allow one merge trigger and crash on the second. By auto-restarting this node, we can thus stress this scenario. |
Hmm, this sounds like a possible outcome of #33312 actually. |
Are you sure? That had a very specific presentation in which a transaction anchored on the rangelog would have a split trigger for a different range attached. I suppose the same thing could instead happen with the merges, but in that case I would expect to see a transaction anchored on the rangelog with a merge trigger attached. I don't see a way in which it would cause a bad merge to actually get committed? |
You're right, my pattern matching wasn't thorough enough. Thanks for the heads up! |
For future reference, the parent SHA this was seen on is 1cd7dcf (Dec 11). I think one explanation of this bug could be at the RocksDB level, i.e. something like a5410eb. However, I'm not aware of anything like that having been fixed after Dec 11. |
(cc @petermattis for potential RocksDB things I've missed) |
I perused the changes between 1cd7dcf and current master and didn't see anything that could be related to RocksDB corruption. |
I have a branch that runs a little experiment -- merging 1000 ranges single-node with a failpoint (
two merges are running at the time (I'm allowing up to 8), though not anywhere "close" to each other. The error is very weird:
Of course r160 is on s1, though it has been deleted. (confirmed by 1547740774.750011000,0 /Meta2/Table/52/1/139: [/Table/52/1/138, /Table/52/1/139)
There's an old intent laying around, supposedly from a time in which r160 was still around and merging other replicas. The range lookups seem to get confused by this intent and take it as gospel, but never try to resolve it (or read "under it"). cc @nvanbenschoten (here's the diff). Here's the script: #!/usr/bin/env bash
killall -9 cockroach
make build
rm -rf cockroach-data
./cockroach start --insecure --background
echo "
SET CLUSTER SETTING kv.range_merge.queue_interval = '1ns';
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;
CREATE TABLE IF NOT EXISTS data (id INT PRIMARY KEY);
ALTER TABLE data SPLIT AT SELECT i FROM generate_series(1, 1000) AS g(i);
SET CLUSTER SETTING kv.range_merge.queue_enabled = true;
" | ./cockroach sql --insecure
killall -9 cockroach
retval=137
while [ $retval -eq 137 ];
do
./cockroach start --insecure --logtostderr --host 127.0.0.1
retval=$?
done
|
Similar story if I use a 4node cluster (where only n1 has the failpoint): #!/usr/bin/env bash
set -euo pipefail
killall -9 cockroach || true
make build
rm -rf cockroach-data* || true
for i in 0 1 2 3; do
./cockroach start --max-offset 10ms --insecure --host 127.0.0.1 --port $((26257+i)) --http-port $((8080+i)) --background --store "cockroach-data${i}" --join 127.0.0.1:26257
if [ $i -eq 0 ]; then ./cockroach init --insecure; fi
done
echo "
SET CLUSTER SETTING kv.range_merge.queue_interval = '1ns';
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;
CREATE TABLE IF NOT EXISTS data (id INT PRIMARY KEY);
ALTER TABLE data SPLIT AT SELECT i FROM generate_series(1, 1000) AS g(i);
SET CLUSTER SETTING kv.range_merge.queue_enabled = true;
" | ./cockroach sql --insecure
./cockroach quit || true
sleep 1
retval=137
while [ $retval -eq 137 ];
do
set +e
./cockroach start --max-offset 10ms --insecure --logtostderr --host 127.0.0.1 --store cockroach-data0
retval=$?
set -e
echo "exit code: $retval"
done
|
we are running into this issue on production cluster as well.
Is there any quick workaround? |
should this be upgraded from an s-2 in light of #35559? |
Yes
…On Sat, Mar 9, 2019, 15:36 Andy Woods ***@***.***> wrote:
should this be upgraded from an s-2 in light of #35559
<#35559>?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#33120 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135ANxxvEXAGLGLTJXGKUBx6ssMWyMks5vU8bSgaJpZM4ZRedl>
.
|
Migrating discussion from #33010 (comment)
When the merge of r31 and r32 was processed, r31's EndKey was extended to Max at the same time that r32 was deleted. So if this was done non-atomically before the first restart, we'd have seen a crash then too. Either the merge was processed atomically and then r32's descriptor was somehow recreated, or it was processed non-atomically after the restart. Without a snapshot (ruled out in the other thread), I don't see how the descriptor could be recreated, so I'll focus on the second possibility. There are subtleties around node restarts during a merge. The subsume command tells the leaseholder to disallow any new commands on the RHS until the merge commits or aborts. This is an in-memory operation, so it is lost on restart. Each new leaseholder checks for this intent when they acquire the lease, which is how we ensure the subsuming state is not lost on restart. There are quite a few layers to this logic that have the potential to go subtly wrong, but I don't see any obvious problems or connections to the symptoms in this case. On restart, we scan the store and recreate Replica objects one at a time. Is it possible that something happened when we created the Replica for r31 before r32's Replica existed? I don't think so, because we don't start up the raft scheduler goroutines until after all the replicas are created. There are a couple of years-old TODOs with my name on them here: cockroach/pkg/storage/store.go Lines 1328 to 1330 in a21dde6
I think the first one is obsolete, but the second might be relevant here. n2 will start back up with an unapplied merge in its raft log for r31, but won't automatically pick it up and apply it until something else wakes up the raft group. Similarly, r32 will stick around in memory and on disk until r31 wakes up. If r32 wakes up first, it will try to start a raft election. It will hear from the other nodes that r32 no longer exists, and send it to the replica GC queue. What happens if we try to replica GC r32 while r31 still has a pending merge? It's tricky: cockroach/pkg/storage/replica_gc_queue.go Lines 297 to 307 in a21dde6
This looks racy: If the LHS applied the merge while the RHS was in the replica GC queue, I think we could find that the LHS caught up while we were waiting and we're about to delete data that isn't ours. I'm not sure if there's other locking here that prevents this. However, this doesn't match the symptoms we're seeing here, so if it's a problem it's a separate one. How about the second merge? Is there any way that the second merge could happen before we've applied the first one? (recall that the ranges in order are r33, r31, r32. First we merged r31 and r32, then we merged r33 with the post-merge r31). We know from the logs that n2 held the lease on r33 when the decision to merge was made, so this merge was not concurrent with either restart. It is, however, possible that r31 wasn't awake yet. The merge process issues a subsume command to the r31 leaseholder, blocking future writes and obtaining the lease index of the last applied write. This lease index must be after the first merge committed. Then we verify that all replicas of r31 have applied that lease index before the merge commits. Lease indexes are almost, but not quite, monotonic. When we update the lease index, we don't fsync the write, so it could regress on restart until we catch up on our raft log again. That could almost explain what we see here: n2 applied the first merge, then "un-applied" it by resetting and losing unsynced writes, which then allowed the second merge to apply before the first one, stranding r32's range descriptor. The problem with this theory is that the second merge didn't happen until after the first restart, so it never had a chance to observe the state that was lost by that restart. |
A couple other random observations that I haven't been able to fit into any theory:
|
For future reference, here is the experiment I mentioned above: |
I think I have a timeline that fits the facts.
I think the simplest solution here is for WaitForApplication to sync rocksdb before returning. For the user cluster that managed to get in this state, it should be safe to delete the offending range descriptor, but to avoid consistency checker violations we need to copy over the mvcc deletion timestamp from the surviving replica. |
Fantastic sleuthing @bdarnell! This makes sense to me. Do you have cycles to follow through with resolution and test or do you want that to be taken off your plate? |
(regardless, I plan to dust off the failpoint binary and put the crash in the right place) |
I have time to throw an untested fix over the wall but I'm not sure if I'll be able to test it. And I definitely won't have time for a recovery tool. |
Logs have arrived: https://drive.google.com/drive/u/1/folders/1o_o3WQO8bYnigWfyMqrTjGjnQmKgQKAO |
This prevents on-disk inconsistencies when a node crashes in the middle of a merge and the lease applied index temporarily regresses. Fixes cockroachdb#33120 Release note (bug fix): Fixed an on-disk inconsistency that could result from a crash during a range merge.
35626: storage: Sync to disk before returning in WaitForApplication r=tbg a=bdarnell This prevents on-disk inconsistencies when a node crashes in the middle of a merge and the lease applied index temporarily regresses. Fixes #33120 Release note (bug fix): Fixed an on-disk inconsistency that could result from a crash during a range merge. The second commit is a manual test (adapted from @tbg) that demonstrates the "overlapping range" failure if the WriteSyncNoop call is removed, and also shows that this call fixes the problem. It's obviously not mergeable in its current state, but I'm not sure how to test this without some gross hacks. Co-authored-by: Ben Darnell <[email protected]>
This prevents on-disk inconsistencies when a node crashes in the middle of a merge and the lease applied index temporarily regresses. Fixes cockroachdb#33120 Release note (bug fix): Fixed an on-disk inconsistency that could result from a crash during a range merge.
See #33010 (comment).
A failure was observed in the acceptance/bank/node-restart (it does not reproduce readily).
Essentially two merges are carried out with interleaved reboots of nodes. After the second merge, a node comes up and sees a replica that should've been wiped off its disk after the first merge.
Full logs attached. See the issue above for initial analysis (which will continue here).
Unit_Tests_roachtest_15561_artifacts.zip
The text was updated successfully, but these errors were encountered: