Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: chaos during merges leads to overlapping range descriptors #33120

Closed
tbg opened this issue Dec 13, 2018 · 24 comments · Fixed by #35626
Closed

storage: chaos during merges leads to overlapping range descriptors #33120

tbg opened this issue Dec 13, 2018 · 24 comments · Fixed by #35626
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Dec 13, 2018

See #33010 (comment).

A failure was observed in the acceptance/bank/node-restart (it does not reproduce readily).

Essentially two merges are carried out with interleaved reboots of nodes. After the second merge, a node comes up and sees a replica that should've been wiped off its disk after the first merge.

Full logs attached. See the issue above for initial analysis (which will continue here).

Unit_Tests_roachtest_15561_artifacts.zip

@tbg tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Dec 13, 2018
@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

@benesch any idea?

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

The intents seem to be OK:

toblerone> I181213 12:11:16.684105 89378 storage/batcheval/cmd_end_transaction.go:466 [n1,s1,r35/1:/Table/53/1/1{64-72}] resolving /Local/Range/Table/53/1/164/RangeDescriptor
toblerone> I181213 12:11:16.684125 89378 storage/batcheval/cmd_end_transaction.go:466 [n1,s1,r35/1:/Table/53/1/1{64-72}] resolving /Local/Range/Table/53/1/172/RangeDescriptor
toblerone> I181213 12:11:16.684192 89378 storage/batcheval/cmd_end_transaction.go:506 [n1,s1,r35/1:/Table/53/1/1{64-72}] external intents on merge [/Meta2/Table/53/1/172 /Meta2/Table/53/1/232 /Table/13/1/2018-12-13T12:11:16.673152Z/408396354948464641/0 /Table/13/1/2018-12-13T12:11:16.673152Z/408396354948464641/2/1 /Table/13/1/2018-12-13T12:11:16.673152Z/408396354948464641/3/1 /Table/13/1/2018-12-13T12:11:16.673152Z/408396354948464641/4/1 /Table/13/1/2018-12-13T12:11:16.673152Z/408396354948464641/5/1 /Table/13/1/2018-12-13T12:11:16.673152Z/408396354948464641/6/1]
left r35:/Table/53/1/{164-232} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=1]
right r31:/Table/53/1/{172-232} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=0]

That leaves me without a hypothesis on how this would've happened.

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

I suppose the next step would be a test that aggressively kills nodes during merges. I realized that n2 is being killed after the first merge, not n1. In particular, it seems that it only fully applies the merge trigger for the first merge after its first restart. Its second restart's timing seems irrelevant (except that it highlights the overlap) as it's 2s after the second merge only, so everything has synced by then.

6ce48ec89eae> I181211 13:53:37.160789 196 storage/store.go:2561  [n1,s1,r31/1:/Table/53/1/{525-710}] removing replica r32/1
6ce48ec89eae> I181211 13:53:37.161239 171 storage/store.go:2561  [n3,s3,r31/2:/Table/53/1/{525-710}] removing replica r32/2
6ce48ec89eae> I181211 13:53:38.055117 40 gossip/gossip.go:391  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:26263" > attrs:<> locality:<tiers:<key:"region" value:"local" > tiers:<key:"zone" value:"local" > > ServerVersion:<major_val:2 minor_val:1 patch:0 unstable:3 > build_tag:"v2.2.0-alpha.20181119-665-g4aa84a5" started_at:1544536418054903781
6ce48ec89eae> I181211 13:53:38.969250 157 storage/store.go:2561  [n2,s2,r31/3:/Table/53/1/{525-710}] removing replica r32/3
6ce48ec89eae> I181211 13:53:43.491190 100 gossip/gossip.go:391  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:26263" > attrs:<> locality:<tiers:<key:"region" value:"local" > tiers:<key:"zone" value:"local" > > ServerVersion:<major_val:2 minor_val:1 patch:0 unstable:3 > build_tag:"v2.2.0-alpha.20181119-665-g4aa84a5" started_at:1544536423490993013
6ce48ec89eae> E181211 13:53:43.503144 1 cli/error.go:229  cockroach server exited with error: failed to start store: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r32/3:/{Table/53/1/7…-Max}] has overlapping range r33:/{Table/53/1/266-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=1]
Error: cockroach server exited with error: failed to start store: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r32/3:/{Table/53/1/7…-Max}] has overlapping range r33:/{Table/53/1/266-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=1]

The setting to reproduce then is a 3 node cluster with a failpoint "during" the application of the merge trigger on one of the nodes. The failpoint would allow one merge trigger and crash on the second. By auto-restarting this node, we can thus stress this scenario.

@tbg tbg added O-qa C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed O-qa labels Dec 13, 2018
@tbg
Copy link
Member Author

tbg commented Jan 8, 2019

Hmm, this sounds like a possible outcome of #33312 actually.

@tbg tbg closed this as completed Jan 8, 2019
@benesch
Copy link
Contributor

benesch commented Jan 8, 2019

Are you sure? That had a very specific presentation in which a transaction anchored on the rangelog would have a split trigger for a different range attached. I suppose the same thing could instead happen with the merges, but in that case I would expect to see a transaction anchored on the rangelog with a merge trigger attached. I don't see a way in which it would cause a bad merge to actually get committed?

@tbg
Copy link
Member Author

tbg commented Jan 8, 2019

You're right, my pattern matching wasn't thorough enough. Thanks for the heads up!

@tbg tbg reopened this Jan 8, 2019
@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

For future reference, the parent SHA this was seen on is 1cd7dcf (Dec 11). I think one explanation of this bug could be at the RocksDB level, i.e. something like a5410eb. However, I'm not aware of anything like that having been fixed after Dec 11.
Either way, I'm going to set up a small experiment that kills a node repeatedly while it's merging ranges.

@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

(cc @petermattis for potential RocksDB things I've missed)

@petermattis
Copy link
Collaborator

I perused the changes between 1cd7dcf and current master and didn't see anything that could be related to RocksDB corruption.

@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

I have a branch that runs a little experiment -- merging 1000 ranges single-node with a failpoint (kill -9) added in the merge trigger. So far it hasn't turned up this particular bug, bug it did highlight a what seems like a deadlock (not necessarily in merges):

I190117 16:02:26.056195 243 server/status/runtime.go:464  [n1] runtime stats: 95 MiB RSS, 179 goroutines, 90 MiB/31 MiB/136 MiB GO alloc/idle/total, 30 MiB/38 MiB CGO alloc/total, 0.0 CGO/sec, 0.0/0.0 %(u/s)time, 0.0 %gc (9x), 17 KiB/17 KiB (r/w)net
I190117 16:02:36.065239 243 server/status/runtime.go:464  [n1] runtime stats: 120 MiB RSS, 179 goroutines, 100 MiB/21 MiB/136 MiB GO alloc/idle/total, 41 MiB/50 MiB CGO alloc/total, 738.3 CGO/sec, 3.1/1.7 %(u/s)time, 0.0 %gc (0x), 8.6 KiB/9.7 KiB (r/w)net
I190117 16:02:46.078499 243 server/status/runtime.go:464  [n1] runtime stats: 132 MiB RSS, 180 goroutines, 110 MiB/11 MiB/136 MiB GO alloc/idle/total, 42 MiB/52 MiB CGO alloc/total, 322.3 CGO/sec, 3.1/1.7 %(u/s)time, 0.0 %gc (0x), 2.3 KiB/2.9 KiB (r/w)net
I190117 16:02:56.089083 243 server/status/runtime.go:464  [n1] runtime stats: 139 MiB RSS, 180 goroutines, 77 MiB/39 MiB/136 MiB GO alloc/idle/total, 42 MiB/52 MiB CGO alloc/total, 283.1 CGO/sec, 2.9/1.6 %(u/s)time, 0.0 %gc (1x), 5.1 KiB/6.3 KiB (r/w)net
I190117 16:03:06.098099 243 server/status/runtime.go:464  [n1] runtime stats: 140 MiB RSS, 180 goroutines, 87 MiB/32 MiB/136 MiB GO alloc/idle/total, 42 MiB/52 MiB CGO alloc/total, 292.1 CGO/sec, 2.9/1.6 %(u/s)time, 0.0 %gc (0x), 4.0 KiB/5.6 KiB (r/w)net
I190117 16:03:16.037327 239 gossip/gossip.go:557  [n1] gossip status (ok, 1 node)
gossip client (0/3 cur/max conns)
gossip server (0/3 cur/max conns, infos 0/0 sent/received, bytes 0B/0B sent/received)
gossip connectivity
  n1 [sentinel];
E190117 16:03:16.095094 148 storage/queue.go:844  [n1,merge,s1,r526/1:/Table/52/1/{504-767}] failed to send RPC: sending to all 1 replicas failed; last error: (err: r789 was not found on s1) <nil>
E190117 16:03:16.096225 258 storage/queue.go:844  [n1,merge,s1,r163/1:/Table/52/1/{141-767}] failed to send RPC: sending to all 1 replicas failed; last error: (err: r789 was not found on s1) <nil>
E190117 16:03:16.096297 136 storage/queue.go:844  [n1,merge,s1,r24/1:/Table/52{-/1/139}] failed to send RPC: sending to all 1 replicas failed; last error: (err: r160 was not found on s1) <nil>
I190117 16:03:16.096538 2545 storage/replica_command.go:394  [n1,merge,s1,r161/1:/Table/52/1/1{39-41}] initiating a merge of r163:/Table/52/1/{141-767} [(n1,s1):1, next=2, gen=7] into this range (lhs+rhs has (size=0 B+0 B qps=0.00+0.05 --> 0.00qps) below threshold (size=0 B, qps=0.00))
I190117 16:03:16.107239 243 server/status/runtime.go:464  [n1] runtime stats: 140 MiB RSS, 182 goroutines, 98 MiB/22 MiB/136 MiB GO alloc/idle/total, 41 MiB/52 MiB CGO alloc/total, 290.2 CGO/sec, 2.6/1.4 %(u/s)time, 0.0 %gc (0x), 2.8 KiB/3.4 KiB (r/w)net
I190117 16:03:19.097088 1685 internal/client/txn.go:632  [n1] async rollback failed: failed to send RPC: sending to all 1 replicas failed; last error: (err: r789 was not found on s1) <nil>
I190117 16:03:19.097140 2544 internal/client/txn.go:632  [n1] async rollback failed: failed to send RPC: sending to all 1 replicas failed; last error: (err: r160 was not found on s1) <nil>
I190117 16:03:26.117306 243 server/status/runtime.go:464  [n1] runtime stats: 141 MiB RSS, 179 goroutines, 108 MiB/13 MiB/136 MiB GO alloc/idle/total, 41 MiB/52 MiB CGO alloc/total, 335.2 CGO/sec, 3.0/1.6 %(u/s)time, 0.0 %gc (0x), 2.7 KiB/3.2 KiB (r/w)net
I190117 16:03:36.127573 243 server/status/runtime.go:464  [n1] runtime stats: 144 MiB RSS, 179 goroutines, 116 MiB/5.1 MiB/136 MiB GO alloc/idle/total, 40 MiB/52 MiB CGO alloc/total, 289.0 CGO/sec, 2.7/1.5 %(u/s)time, 0.0 %gc (0x), 3.7 KiB/5.0 KiB (r/w)net
E190117 16:03:37.086367 1070 storage/consistency_queue.go:128  [n1,consistencyChecker,s1,r161/1:/Table/52/1/1{39-41}] aborted in distSender: context deadline exceeded
E190117 16:03:37.086477 1070 storage/queue.go:844  [n1,consistencyChecker,s1,r161/1:/Table/52/1/1{39-41}] aborted in distSender: context deadline exceeded
I190117 16:03:46.138752 243 server/status/runtime.go:464  [n1] runtime stats: 150 MiB RSS, 178 goroutines, 124 MiB/60 MiB/202 MiB GO alloc/idle/total, 40 MiB/53 MiB CGO alloc/total, 488.3 CGO/sec, 2.5/1.5 %(u/s)time, 0.0 %gc (0x), 2.1 KiB/2.7 KiB (r/w)net
I190117 16:03:56.149750 243 server/status/runtime.go:464  [n1] runtime stats: 159 MiB RSS, 178 goroutines, 132 MiB/53 MiB/202 MiB GO alloc/idle/total, 39 MiB/53 MiB CGO alloc/total, 253.9 CGO/sec, 2.6/1.4 %(u/s)time, 0.0 %gc (1x), 4.9 KiB/10 KiB (r/w)net
I190117 16:04:06.160678 243 server/status/runtime.go:464  [n1] runtime stats: 160 MiB RSS, 178 goroutines, 83 MiB/98 MiB/202 MiB GO alloc/idle/total, 39 MiB/53 MiB CGO alloc/total, 261.4 CGO/sec, 2.3/1.4 %(u/s)time, 0.0 %gc (0x), 33 KiB/23 KiB (r/w)net

two merges are running at the time (I'm allowing up to 8), though not anywhere "close" to each other.
I've also disabled "proactive" intent resolution on EndTxn (i.e. concurrent reads are expected to clean up the intents).

The error is very weird:

I190117 16:08:16.103639 2545 storage/replica_command.go:394 [n1,merge,s1,r161/1:/Table/52/1/1{39-41}] initiating a merge of r163:/Table/52/1/{141-767} [(n1,s1):1, next=2, gen=7] into this range (lhs+rhs has (size=0 B+0 B qps=0.00+0.03 --> 0.00qps) below threshold (size=0 B, qps=0.00))
E190117 16:08:16.104266 11176 storage/queue.go:844 [n1,merge,s1,r24/1:/Table/52{-/1/139}] failed to send RPC: sending to all 1 replicas failed; last error: (err: r160 was not found on s1)

Of course r160 is on s1, though it has been deleted. (confirmed by debug range-descriptors). In the meta ranges, I see this:

1547740774.750011000,0 /Meta2/Table/52/1/139: [/Table/52/1/138, /Table/52/1/139)
Raw:r160:/Table/52/1/13{8-9} [(n1,s1):1, next=2, gen=1]

0.000000000,0 /Meta2/Table/52/1/140: 1547740804.529613000,0 {Txn:id:648be93a-d9f3-4b05-b054-fba17333fd23 key:"\001k\022\274\211\366\212\000\001rdsc" timestamp:<wall_time:1547740804529613000 > priority:1411087 sequence:9  Timestamp:1547740804.530158000,1 Deleted:false KeyBytes:12 ValBytes:32 RawBytes:[] IntentHistory:[] MergeTimestamp:<nil> XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}

1547740804.530158000,1 /Meta2/Table/52/1/140: [/Table/52/1/138, /Table/52/1/140)
        Raw:r160:/Table/52/1/1{38-40} [(n1,s1):1, next=2, gen=2]

There's an old intent laying around, supposedly from a time in which r160 was still around and merging other replicas. The range lookups seem to get confused by this intent and take it as gospel, but never try to resolve it (or read "under it"). cc @nvanbenschoten (here's the diff).

Here's the script:

#!/usr/bin/env bash

killall -9 cockroach
make build
rm -rf cockroach-data
./cockroach start --insecure --background
echo "
SET CLUSTER SETTING kv.range_merge.queue_interval = '1ns';
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;
CREATE TABLE IF NOT EXISTS data (id INT PRIMARY KEY);
ALTER TABLE data SPLIT AT SELECT i FROM generate_series(1, 1000) AS g(i);
SET CLUSTER SETTING kv.range_merge.queue_enabled = true;
" | ./cockroach sql --insecure
killall -9 cockroach

retval=137
while [ $retval -eq 137 ];
do
  ./cockroach start --insecure --logtostderr --host 127.0.0.1
  retval=$?
done

@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

Disabling that intent resolution hack (and putting the queue on steroids, etc, this just breezes through:

image

No sign of any on-disk corruption. I'm going to go back to the logs in the original failure to try to extract anything else about the crash.

@tbg
Copy link
Member Author

tbg commented Jan 17, 2019

Similar story if I use a 4node cluster (where only n1 has the failpoint):

image

#!/usr/bin/env bash
set -euo pipefail

killall -9 cockroach || true
make build
rm -rf cockroach-data* || true
for i in 0 1 2 3; do
  ./cockroach start --max-offset 10ms --insecure --host 127.0.0.1 --port $((26257+i)) --http-port $((8080+i)) --background --store "cockroach-data${i}" --join 127.0.0.1:26257
  if [ $i -eq 0 ]; then ./cockroach init --insecure; fi
done
echo "
SET CLUSTER SETTING kv.range_merge.queue_interval = '1ns';
SET CLUSTER SETTING kv.range_merge.queue_enabled = false;
CREATE TABLE IF NOT EXISTS data (id INT PRIMARY KEY);
ALTER TABLE data SPLIT AT SELECT i FROM generate_series(1, 1000) AS g(i);
SET CLUSTER SETTING kv.range_merge.queue_enabled = true;
" | ./cockroach sql --insecure

./cockroach quit || true
sleep 1


retval=137
while [ $retval -eq 137 ];
do
  set +e
  ./cockroach start --max-offset 10ms --insecure --logtostderr --host 127.0.0.1 --store cockroach-data0
  retval=$?
  set -e
  echo "exit code: $retval"
done

@lamb84
Copy link

lamb84 commented Mar 8, 2019

we are running into this issue on production cluster as well.

E190308 17:48:33.285462 1 cli/error.go:230  cockroach server exited with error: failed to start store: [n4,s4]: cannot addReplicaInternalLocked; range [n4,s4,r88340/2:/Table/16572/1/"\x8{c3\…-d:7…}] has overlapping range r136958:/Table/16572/1/"\x8{b\xb1\xfcB\x8d'J\xe2\xba&@\xa8\xc44\xadA"-eA6\xdf\xe2\xc4N;\x86\x88\x82\x14V\xa7O\x06"} [(n11,s11):1, (n4,s4):2, (n10,s10):3, next=4, gen=0]
Error: cockroach server exited with error: failed to start store: [n4,s4]: cannot addReplicaInternalLocked; range [n4,s4,r88340/2:/Table/16572/1/"\x8{c3\…-d:7…}] has overlapping range r136958:/Table/16572/1/"\x8{b\xb1\xfcB\x8d'J\xe2\xba&@\xa8\xc44\xadA"-eA6\xdf\xe2\xc4N;\x86\x88\x82\x14V\xa7O\x06"} [(n11,s11):1, (n4,s4):2, (n10,s10):3, next=4, gen=0]
Failed running "start"

Is there any quick workaround?

@tbg
Copy link
Member Author

tbg commented Mar 9, 2019

Happened in the wild in: #35559
@lamb84 that issue is your incident, right?

@awoods187
Copy link
Contributor

should this be upgraded from an s-2 in light of #35559?

@tbg
Copy link
Member Author

tbg commented Mar 9, 2019 via email

@bdarnell
Copy link
Contributor

Migrating discussion from #33010 (comment)

cluster merges r31 [525-710) and r32 [710-Max)
n2 processes it, removing replica 32/3
n2 gets killed and restarts
cluster merges r33 [266-525) and r31 [525-Max)
n2 process it, removing replica r31/3
n2 restarts and finds n32/3 [710-Max) which should have been gc'ed, as well as r33 [266-Max) (which is supposed to be there).

My reading of this is that the first time n2 got killed, somehow data for r32 stuck around. This didn't become obvious during
the first restart, but then the second merge created a replica overlapping that keyspace which triggered the fatal error after
another restart.

When the merge of r31 and r32 was processed, r31's EndKey was extended to Max at the same time that r32 was deleted. So if this was done non-atomically before the first restart, we'd have seen a crash then too. Either the merge was processed atomically and then r32's descriptor was somehow recreated, or it was processed non-atomically after the restart. Without a snapshot (ruled out in the other thread), I don't see how the descriptor could be recreated, so I'll focus on the second possibility.

There are subtleties around node restarts during a merge. The subsume command tells the leaseholder to disallow any new commands on the RHS until the merge commits or aborts. This is an in-memory operation, so it is lost on restart. Each new leaseholder checks for this intent when they acquire the lease, which is how we ensure the subsuming state is not lost on restart. There are quite a few layers to this logic that have the potential to go subtly wrong, but I don't see any obvious problems or connections to the symptoms in this case.

On restart, we scan the store and recreate Replica objects one at a time. Is it possible that something happened when we created the Replica for r31 before r32's Replica existed? I don't think so, because we don't start up the raft scheduler goroutines until after all the replicas are created. There are a couple of years-old TODOs with my name on them here:

cockroach/pkg/storage/store.go

Lines 1328 to 1330 in a21dde6

// TODO(bdarnell): Also initialize raft groups when read leases are needed.
// TODO(bdarnell): Scan all ranges at startup for unapplied log entries
// and initialize those groups.

I think the first one is obsolete, but the second might be relevant here. n2 will start back up with an unapplied merge in its raft log for r31, but won't automatically pick it up and apply it until something else wakes up the raft group. Similarly, r32 will stick around in memory and on disk until r31 wakes up.

If r32 wakes up first, it will try to start a raft election. It will hear from the other nodes that r32 no longer exists, and send it to the replica GC queue. What happens if we try to replica GC r32 while r31 still has a pending merge? It's tricky:

// This case is tricky. This range has been merged away, so it is likely
// that we can GC this replica, but we need to be careful. If this store has
// a replica of the subsuming range that has not yet applied the merge
// trigger, we must not GC this replica.
//
// We can't just ask our local left neighbor whether it has an unapplied
// merge, as if it's a slow follower it might not have learned about the
// merge yet! What we can do, though, is check whether the generation of our
// local left neighbor matches the generation of its meta2 descriptor. If it
// is generationally up-to-date, it has applied all splits and merges, and
// it is thus safe to remove this replica.

This looks racy: If the LHS applied the merge while the RHS was in the replica GC queue, I think we could find that the LHS caught up while we were waiting and we're about to delete data that isn't ours. I'm not sure if there's other locking here that prevents this. However, this doesn't match the symptoms we're seeing here, so if it's a problem it's a separate one.

How about the second merge? Is there any way that the second merge could happen before we've applied the first one? (recall that the ranges in order are r33, r31, r32. First we merged r31 and r32, then we merged r33 with the post-merge r31).

We know from the logs that n2 held the lease on r33 when the decision to merge was made, so this merge was not concurrent with either restart. It is, however, possible that r31 wasn't awake yet. The merge process issues a subsume command to the r31 leaseholder, blocking future writes and obtaining the lease index of the last applied write. This lease index must be after the first merge committed. Then we verify that all replicas of r31 have applied that lease index before the merge commits.

Lease indexes are almost, but not quite, monotonic. When we update the lease index, we don't fsync the write, so it could regress on restart until we catch up on our raft log again. That could almost explain what we see here: n2 applied the first merge, then "un-applied" it by resetting and losing unsynced writes, which then allowed the second merge to apply before the first one, stranding r32's range descriptor. The problem with this theory is that the second merge didn't happen until after the first restart, so it never had a chance to observe the state that was lost by that restart.

@bdarnell
Copy link
Contributor

A couple other random observations that I haven't been able to fit into any theory:

  • n1 attempted a different merge involving r33 as RHS, which apparently failed or blocked but started just before the one with r33 as LHS did.
  • While there were no snapshots applied on n2, there was a raft-initiated snapshot sent from n1 to n3 in the middle of the merge
$ grep -hE '(merge|starting grpc|removing repl|r32|r33|r31)' */*.log|cut -c2-|sort
...
181211 13:53:35.765480 185 storage/replica.go:5041  [n3,s3,r32/2:{-}] allowing MsgAppResp for uninitialized replica (25 > 20 ticks)
181211 13:53:35.774729 338 storage/queue.go:836  [n1,raftsnapshot,s1,r32/1:/{Table/53/1/7…-Max}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot ae7cffc5 at applied index 14 with error: [n3,s3],r32: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r31/2:/{Table/53/1/5…-Max}] (incoming /{Table/53/1/710-Max})
181211 13:53:35.833198 357 storage/replica_command.go:492  [n1,merge,s1,r31/1:/Table/53/1/{525-710}] initiating a merge of r32:/{Table/53/1/710-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=0] into this range
181211 13:53:36.743253 434 storage/store_snapshot.go:762  [n1,raftsnapshot,s1,r32/1:/{Table/53/1/7…-Max}] sending Raft snapshot dfb0e3bf at applied index 17
181211 13:53:36.743650 434 storage/store_snapshot.go:805  [n1,raftsnapshot,s1,r32/1:/{Table/53/1/7…-Max}] streamed snapshot to (n3,s3):2: kv pairs: 304, log entries: 7, rate-limit: 8.0 MiB/sec, 0.00s
181211 13:53:36.744004 488 storage/replica_raftstorage.go:809  [n3,s3,r32/2:/{Table/53/1/7…-Max}] applying Raft snapshot at index 17 (id=dfb0e3bf, encoded size=8986, 1 rocksdb batches, 7 log entries)
181211 13:53:36.748331 488 storage/replica_raftstorage.go:815  [n3,s3,r32/2:/{Table/53/1/7…-Max}] applied Raft snapshot in 4ms [clear=0ms batch=0ms entries=0ms commit=4ms]
181211 13:53:37.160789 196 storage/store.go:2561  [n1,s1,r31/1:/Table/53/1/{525-710}] removing replica r32/1
181211 13:53:37.161239 171 storage/store.go:2561  [n3,s3,r31/2:/Table/53/1/{525-710}] removing replica r32/2
181211 13:53:38.066412 40 server/server.go:1528  [n2] starting grpc/postgres server at 127.0.0.1:26263
181211 13:53:38.969250 157 storage/store.go:2561  [n2,s2,r31/3:/Table/53/1/{525-710}] removing replica r32/3
181211 13:53:41.313794 301 storage/replica_command.go:492  [n1,merge,s1,r22/1:/Table/53{-/1/266}] initiating a merge of r33:/Table/53/1/{266-525} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=0] into this range
181211 13:53:41.319876 94 storage/replica_command.go:492  [n2,merge,s2,r33/3:/Table/53/1/{266-525}] initiating a merge of r31:/{Table/53/1/525-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=2] into this range
181211 13:53:41.364031 206 storage/store.go:2561  [n3,s3,r33/2:/Table/53/1/{266-525}] removing replica r31/2
181211 13:53:41.364059 176 storage/store.go:2561  [n2,s2,r33/3:/Table/53/1/{266-525}] removing replica r31/3
181211 13:53:41.364146 180 storage/store.go:2561  [n1,s1,r33/1:/Table/53/1/{266-525}] removing replica r31/1
181211 13:53:43.503144 1 cli/error.go:229  cockroach server exited with error: failed to start store: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r32/3:/{Table/53/1/7…-Max}] has overlapping range r33:/{Table/53/1/266-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=1]
rror: cockroach server exited with error: failed to start store: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r32/3:/{Table/53/1/7…-Max}] has overlapping range r33:/{Table/53/1/266-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=1]

@tbg
Copy link
Member Author

tbg commented Mar 11, 2019

For future reference, here is the experiment I mentioned above:

https://github.com/cockroachdb/cockroach/compare/master...tbg:roachtest/merge-chaos?expand=1

@bdarnell
Copy link
Contributor

The problem with this theory is that the second merge didn't happen until after the first restart, so it never had a chance to observe the state that was lost by that restart.

I think I have a timeline that fits the facts.

  1. n1 initiates the first merge (r31+r32)
  2. During the merge txn, a delete intent is written to the r32 descriptor. The lease index LI of this write is the one returned by Subsume (on the leaseholder n1).
  3. n2 applies the delete intent, without syncing because it was not batched with a MustSync raft.Ready.
  4. n1 executes WaitForApplication(LI). All followers, including n2, return success.
  5. n2 crashes and loses unsynced writes. Its applied lease index regresses and the delete intent is no longer present after the restart.
  6. The merge commits (via an EndTransaction with MergeTrigger evaluated on r31). The EndTransaction resolves its local intents, so its write batch includes a delete of the MVCCMetadata for the intent.
  7. n2 processes the raft log for r31 and applies the write batch. It has not yet caught up on the r31 raft log, so the intent MVCCMetadata is not present. This part of the write batch is a no-op, but the merge trigger applies as normal. Our in-memory state now correctly represents the result of the merge, but there is a range descriptor left behind on disk that is inconsistent with what we have in memory!
  8. n2 restarts a second time and discovers the conflicts between the on-disk range descriptors.

I think the simplest solution here is for WaitForApplication to sync rocksdb before returning. For the user cluster that managed to get in this state, it should be safe to delete the offending range descriptor, but to avoid consistency checker violations we need to copy over the mvcc deletion timestamp from the surviving replica.

@tbg
Copy link
Member Author

tbg commented Mar 11, 2019

Fantastic sleuthing @bdarnell! This makes sense to me. Do you have cycles to follow through with resolution and test or do you want that to be taken off your plate?

@tbg
Copy link
Member Author

tbg commented Mar 11, 2019

(regardless, I plan to dust off the failpoint binary and put the crash in the right place)

@bdarnell
Copy link
Contributor

I have time to throw an untested fix over the wall but I'm not sure if I'll be able to test it. And I definitely won't have time for a recovery tool.

@tim-o
Copy link
Contributor

tim-o commented Mar 11, 2019

bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 12, 2019
This prevents on-disk inconsistencies when a node crashes in the
middle of a merge and the lease applied index temporarily regresses.

Fixes cockroachdb#33120

Release note (bug fix): Fixed an on-disk inconsistency that could
result from a crash during a range merge.
craig bot pushed a commit that referenced this issue Mar 12, 2019
35626: storage: Sync to disk before returning in WaitForApplication r=tbg a=bdarnell

This prevents on-disk inconsistencies when a node crashes in the
middle of a merge and the lease applied index temporarily regresses.

Fixes #33120

Release note (bug fix): Fixed an on-disk inconsistency that could
result from a crash during a range merge.


The second commit is a manual test (adapted from @tbg) that demonstrates the "overlapping range" failure if the WriteSyncNoop call is removed, and also shows that this call fixes the problem. It's obviously not mergeable in its current state, but I'm not sure how to test this without some gross hacks. 

Co-authored-by: Ben Darnell <[email protected]>
@craig craig bot closed this as completed in #35626 Mar 12, 2019
bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 14, 2019
This prevents on-disk inconsistencies when a node crashes in the
middle of a merge and the lease applied index temporarily regresses.

Fixes cockroachdb#33120

Release note (bug fix): Fixed an on-disk inconsistency that could
result from a crash during a range merge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants