storage: replicated into unavailability #36025

tbg · 2019-03-21T14:41:26Z

Apparently there are situations in which ranges become unavailable with only one dead node. Assuming a replication factor of three which is initially fully replicated, no rebalancing decision that would result in a group size of two should ever be carried out (since we always upreplicate first before down-replicating).

For example, below we can see a range that was at one point 5x replicated and then get shrunk to 2x, at which point it became unavailable since one of its replicas resides on a dead node. This kind of replication decision simply amounts to a bug.

Access to replication is granted relatively freely via RelocateRanges. The algorithm in RelocateRanges isn't aware that it might be doing harm. It is likely what did harm here, perhaps triggered through merges, though I wasn't able to figure out the specifics without really digging in.

My initial analysis was this:

I just looked at your cluster and it looks the same - we're in a two-replica config and one of the peers is dead. The range log shows that the range split, then grew to five replicas, and then went down to two, all through "admin requests". This is storagepb.ReasonAdminRequest and is used when a ChangeReplicas is sent to the replica in a Batch. In production code, the only code path that does this is Store.AdminRelocateRange which - nasty code, but I'll save that rant - is only called into by the merge queue, relocateNode (guess you're not manually relocating things, so that one is out), and the StoreRebalancer. So look at whether the merge queue or the store rebalancer did something to that range.

Hrm, the merge queue might try to relocate the range without printing anything to the logs (there's nothing interesting there):

cockroach/pkg/storage/merge_queue.go

Lines 271 to 292 in 7eed200

    
           if !replicaSetsEqual(lhsDesc.Replicas, rhsDesc.Replicas) { 
        
           	var targets []roachpb.ReplicationTarget 
        
           	for _, lhsReplDesc := range lhsDesc.Replicas { 
        
           		targets = append(targets, roachpb.ReplicationTarget{ 
        
           			NodeID: lhsReplDesc.NodeID, StoreID: lhsReplDesc.StoreID, 
        
           		}) 
        
           	} 
        
           	lease, _ := lhsRepl.GetLease() 
        
           	for i := range targets { 
        
           		if targets[i].NodeID == lease.Replica.NodeID && targets[i].StoreID == lease.Replica.StoreID { 
        
           			if i > 0 { 
        
           				targets[0], targets[i] = targets[i], targets[0] 
        
           			} 
        
           			break 
        
           		} 
        
           	} 
        
           	// TODO(benesch): RelocateRange can sometimes fail if it needs to move a replica 
        
           	// from one store to another store on the same node. 
        
           	if err := mq.store.DB().AdminRelocateRange(ctx, rhsDesc.StartKey, targets); err != nil { 
        
           		return err 
        
           	} 
        
           }

RelocateRange doesn't look like it's very smart about staying out of dangerous replication configs. It compiles a list of adds and removes it needs to carry out. In this case, we initially have five replicas (understanding why that is, who knows). Let's assume the merge queue wants to colocate with another replica which has three replicas. Let's say only the first two overlap. To colocate, it thus has to remove three replicas, and add one. Unfortunately the algorithm does the operation of which there are more left first (on tie, it prefers to add). So it will do a removal, and another removal, but then it should do an addition (but we see a third removal). If it tried to colocate with another range that only had two replicas, we'd get the third removal. (The cluster only has 7 nodes, so there's no way that the other range overlapped less with three replicas). Then the question becomes, why does that other range have two replicas. The left neighboring range that r2948 originally split from doesn't exist any more, so it must've been merged away at some point, making it reasonable to assume that it might've tried to merge its right neighbor (r2948), too.

Either way, there's something to file here. You should try disabling the merge queue in what you're testing (assuming it isn't relevant, did we see merges during the inconsistency you're chasing?) to see if the problem goes away. Taking a bit of a step back, this is just yet another instance of how brittle our replication code is. Lots of places get to change the replication config but only some of them are actually considering the impact of them on availability.

We need to make sure no replication change that would result in a straightforward unavailability is permitted. All of these checks right now live in the allocator, which tries to not suggest such changes. However, RelocateRange does not use the allocator intentionally, because it does not want to be bound by constraints. This area of the code needs a general reworking.

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-03-25T19:42:01Z

I'm going to add logging around RelocateRange and work on reproducing.

ajwerner · 2019-03-25T20:51:51Z

The good news is that this is readily reproducible. We start with three replicas and we want to add one and remove one. For reasons I don't yet understand we quickly add and then remove the new replica (on n4) and then proceed to remove n6 as planned leaving us with 2 replicas.

ip-172-31-39-197> I190325 20:29:07.272294 16158 storage/replica_command.go:1119  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] AdminRelocateRange: desc: r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n6,s6):5, (n9,s9):6, (n7,s7):4, next=7, gen=1], targets: [n9,s9 n7,s7 n4,s4], addTargets: [(n4,s4):?], removeTargets: [(n6,s6):?] 
ip-172-31-39-197> I190325 20:29:07.272743 16158 storage/store_snapshot.go:775  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] sending preemptive snapshot e59cc9fe at applied index 36                               
ip-172-31-39-197> I190325 20:29:07.272852 16158 storage/store_snapshot.go:818  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] streamed snapshot to (n4,s4):?: kv pairs: 11, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
ip-172-31-37-199> I190325 20:29:07.273155 36722 storage/replica_raftstorage.go:814  [n4,s4,r43/?:{-}] applying preemptive snapshot at index 36 (id=e59cc9fe, encoded size=846, 1 rocksdb batches, 0 log entries)  
ip-172-31-37-199> I190325 20:29:07.274950 36722 storage/replica_raftstorage.go:820  [n4,s4,r43/?:/Table/55/1/65{08/6/…-45/5/…}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=0ms commit=2ms]   
ip-172-31-39-197> I190325 20:29:07.275399 16158 storage/replica_command.go:788  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] change replicas (ADD_REPLICA (n4,s4):7): read existing descriptor r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n6,s6):5, (n9,s9):6, (n7,s7):4, next=7, gen=1]
ip-172-31-39-197> I190325 20:29:07.287446 16158 storage/replica_raft.go:394  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] proposing ADD_REPLICA((n4,s4):7): updated=[(n6,s6):5 (n9,s9):6 (n7,s7):4 (n4,s4):7] next=8
ip-172-31-34-186> I190325 20:29:07.301543 14436 storage/replica_command.go:788  [n9,replicate,s9,r43/6:/Table/55/1/65{08/6/…-45/5/…}] change replicas (REMOVE_REPLICA (n4,s4):7): read existing descriptor r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n6,s6):5, (n9,s9):6, (n7,s7):4, (n4,s4):7, next=8, gen=1]
ip-172-31-34-186> I190325 20:29:07.307470 14436 storage/replica_raft.go:394  [n9,s9,r43/6:/Table/55/1/65{08/6/…-45/5/…}] proposing REMOVE_REPLICA((n4,s4):7): updated=[(n6,s6):5 (n9,s9):6 (n7,s7):4] next=8      
ip-172-31-37-199> I190325 20:29:07.317582 36748 storage/store.go:2467  [n4,replicaGC,s4,r43/7:/Table/55/1/65{08/6/…-45/5/…}] removing replica r43/7                                                               
ip-172-31-37-199> I190325 20:29:07.320198 36748 storage/replica_destroy.go:150  [n4,replicaGC,s4,r43/7:/Table/55/1/65{08/6/…-45/5/…}] removed 6 (0+6) keys in 3ms [clear=0ms commit=2ms]                                                                                                      
ip-172-31-34-186> I190325 20:29:11.585460 9160 storage/replica_command.go:1119  [n9,s9,r43/6:/Table/55/1/65{08/6/…-45/5/…}] AdminRelocateRange: desc: r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n7,s7):4, (n9,s9):6, next=8, gen=1], targets: [n9,s9 n7,s7 n4,s4], addTargets: [(n4,s4):?], removeTargets: []

bdarnell · 2019-03-25T21:18:53Z

The removal comes from the replicate queue (you can see the ,replicate, in the log tags). It looks like you're racing against it and the queue noticed the over-replication. But when removing over-replication, it's supposed to exempt the most recently added replica. Maybe there's something wrong around lastReplicaAdded?

In any case, this is why Replica.ChangeReplicas takes a range descriptor for the expected value, instead of blindly applying the given delta to what's there. This is lost when going through AdminChangeReplicasRequest. This request needs to gain a RangeDescriptor field so that when AdminRelocateRange encounters an unexpected RangeDescriptor it can reset and recompute what changes need to be made.

ajwerner · 2019-03-25T21:33:27Z

Indeed the other ranges also seem to have raced with the replicate queue. I'm going to attack this by first figuring out what's going on with the lastReplicaAdded thing and then look at adding a RangeDescriptor to AdminRelocateRange.

bdarnell · 2019-03-25T21:36:25Z

Another option that might be simpler than adding a descriptor to AdminChangeReplicas (not AdminRelocateRange): Instead of computing all the deltas at once and then applying them, do it one at a time and refresh the range descriptor from the source each time.

Neither fixing lastReplicaAdded or making AdminRelocateRange refresh the range descriptor are complete solutions, they just reduce the window of vulnerability. Only plumbing an expected value through AdminChangeReplicas is a complete solution.

ajwerner · 2019-03-26T04:15:43Z

Plumbing an expected RangeDescriptor into AdminChangeReplicas doesn't seem particularly involved.

That being said, refreshing the range descriptor will have to be a part of this fix too (though maybe it can be provided by the error after CPut failure). What's the idiomatic way to go look up the range descriptor? Nothing screamed at me going through client and the api proto.

tbg · 2019-03-26T13:18:04Z

cockroach/pkg/storage/replica_command.go

Lines 785 to 787 in 56b79f1

    
           if err := txn.GetProto(ctx, descKey, oldDesc); err != nil { 
        
           	return err 
        
           }

is the code that reads it from the kv store. (Not sure if that's really your question)

BTW this whole AdminChangeReplicas sitting on the client.DB is nonsense, you can just use a client.Batch and use AddRawRequest and then get the response from b.RawResponse (and see b.MustPErr() for getting the structured *roachpb.Error. I don't know if you want to pull on removing that stuff now (probably not), but at least whichever usage of that you come across you probably want to replace with just peeling back the abstraction (raw requests).

ajwerner · 2019-03-26T13:57:01Z

Cool, I just needed GetProto and keys.RangeDescriptorKey. I'm going to start with a minimal change and we can see how far to take it in review.

ajwerner · 2019-03-26T22:17:34Z

Before I looked too carefully I thought this might be some issue with the propagation of the key which prevents merges for ranges being imported but upon deeper analysis it's clear that the import for the table in question had completed.

An interesting observation here is that we replicated these ranges into unavailability over 2 minutes after node 7 was killed (190325 20:26:59). The fix discussed above will protect us from this issue making us unavailable in the future and is worthwhile but I wonder if it's worth adding logic to the merge queue to ensure that all of the current replicas of both sides are alive. I'll add logic to check on liveness and avoid attempting to merge ranges which have replicas on dead nodes in a separate PR.

ajwerner · 2019-03-27T21:02:54Z

The first of the PRs is up. The plan is to add two more.

Add some low level sanity checks regarding liveness to Replica.changeReplicas
- refuse to add a replica on a down node
- refuse to remove a live replica if there already exists a replica on a down node
- any others?
Add some liveness checking to the merge queue
- less important given 1) but still probably worth adding

…ateRange This commit adds a new ExpDesc RangeDescriptor field to the AdminChangeReplicasRequest which is provided as the expected current RangeDescriptor to the ChangeReplicas call. Providing this argument prevents hazards caused by concurrent calls to ChangeReplicas. Given that the AdminChangeReplicas RPC allows clients to specify multiple changes, the provided expectation only directly applies to the first change. Subsequent calls to ChangeReplicas use the value of the RangeDescriptor as returned from the previous call to ChangeReplicas thus any concurrent modification of the RangeDescriptor will lead to a failure. The change also adds adds a response field to the AdminChangeReplicasResponse which is the updated value of the RangeDescriptor upon successful completion of the request. The client.DB interface is updated to adopt these new fields. Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous replication scenarios as observed in cockroachdb#36025. Release note: None

36244: Add expected RangeDescriptor to AdminChangeReplicas and adopt r=ajwerner a=ajwerner This PR comes in two commits. The first updates the roachpb api and the second adopts the update. This is the first of the PRs which will address #36025. See the commit messages for more details. Co-authored-by: Andrew Werner <[email protected]>

…ateRange This commit adds a new ExpDesc RangeDescriptor field to the AdminChangeReplicasRequest which is provided as the expected current RangeDescriptor to the ChangeReplicas call. Providing this argument prevents hazards caused by concurrent calls to ChangeReplicas. Given that the AdminChangeReplicas RPC allows clients to specify multiple changes, the provided expectation only directly applies to the first change. Subsequent calls to ChangeReplicas use the value of the RangeDescriptor as returned from the previous call to ChangeReplicas thus any concurrent modification of the RangeDescriptor will lead to a failure. The change also adds adds a response field to the AdminChangeReplicasResponse which is the updated value of the RangeDescriptor upon successful completion of the request. The client.DB interface is updated to adopt these new fields. Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous replication scenarios as observed in cockroachdb#36025. Release note: None

a-robinson · 2019-04-06T06:19:57Z

The removal comes from the replicate queue (you can see the ,replicate, in the log tags). It looks like you're racing against it and the queue noticed the over-replication

I'm not totally sure how I ended up here late on a Friday night (I blame @justinj), but that looks a lot like #31287, which as of a couple months I would have been very scared about pushing a new release without fixing. I haven't looked through your fix for this, but hopefully it also covers that.

ajwerner · 2019-04-06T13:25:09Z

I haven't looked through your fix for this, but hopefully it also covers that.

I’m not sure that it totally fixes the test flake but it does fix the potential for under-replication due to the race. Thanks for bringing the issue to my attention.

38843: roachpb: make AdminChangeReplicasRequest.ExpDesc non-nullable r=ajwerner a=ajwerner Prior to 19.1 AdminChangeReplicas did not take an expected range descriptor value which allowed for races with other changes to the range descriptor. For backwards compatibility the expectation was left as an optional field. This commit completes the migration and makes ExpDesc a required, non-nullable field. Clients which supply a zero value ExpDesc will now receive an error. Relates to #36025. Release note: None Co-authored-by: Andrew Werner <[email protected]>

github-actions · 2021-06-04T19:53:14Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

andreimatei · 2021-06-07T21:02:00Z

@tbg do you know anything about this issue?

tbg · 2021-06-08T08:29:01Z

We have since added low-level checks that make sure (to the degree it can reasonably be done) that the "new majority" is live.

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Mar 21, 2019

tbg added this to the 19.2 milestone Mar 21, 2019

tbg assigned bdarnell Mar 21, 2019

andreimatei mentioned this issue Mar 25, 2019

19.1 release blockers list #35554

Closed

23 tasks

ajwerner self-assigned this Mar 25, 2019

ajwerner mentioned this issue Mar 27, 2019

Add expected RangeDescriptor to AdminChangeReplicas and adopt #36244

Merged

ajwerner mentioned this issue Mar 29, 2019

storage: sanity check ChangeReplicas with regards to liveness #36317

Closed

This was referenced Mar 31, 2019

release-19.1: Add expected RangeDescriptor to AdminChangeReplicas and adopt #36362

Merged

storage: prevent attempts to merge ranges which contain non-live replicas #36503

Closed

ajwerner mentioned this issue Jul 12, 2019

roachpb: make AdminChangeReplicasRequest.ExpDesc non-nullable #38843

Merged

bdarnell removed their assignment Jul 31, 2019

github-actions bot added the no-issue-activity label Jun 4, 2021

andreimatei removed the no-issue-activity label Jun 7, 2021

tbg closed this as completed Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: replicated into unavailability #36025

storage: replicated into unavailability #36025

tbg commented Mar 21, 2019

ajwerner commented Mar 25, 2019

ajwerner commented Mar 25, 2019

bdarnell commented Mar 25, 2019

ajwerner commented Mar 25, 2019

bdarnell commented Mar 25, 2019

ajwerner commented Mar 26, 2019

tbg commented Mar 26, 2019

ajwerner commented Mar 26, 2019

ajwerner commented Mar 26, 2019

ajwerner commented Mar 27, 2019

a-robinson commented Apr 6, 2019

ajwerner commented Apr 6, 2019

github-actions bot commented Jun 4, 2021

andreimatei commented Jun 7, 2021

tbg commented Jun 8, 2021

storage: replicated into unavailability #36025

storage: replicated into unavailability #36025

Comments

tbg commented Mar 21, 2019

ajwerner commented Mar 25, 2019

ajwerner commented Mar 25, 2019

bdarnell commented Mar 25, 2019

ajwerner commented Mar 25, 2019

bdarnell commented Mar 25, 2019

ajwerner commented Mar 26, 2019

tbg commented Mar 26, 2019

ajwerner commented Mar 26, 2019

ajwerner commented Mar 26, 2019

ajwerner commented Mar 27, 2019

a-robinson commented Apr 6, 2019

ajwerner commented Apr 6, 2019

github-actions bot commented Jun 4, 2021

andreimatei commented Jun 7, 2021

tbg commented Jun 8, 2021