Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: replicated into unavailability #36025

Closed
tbg opened this issue Mar 21, 2019 · 15 comments
Closed

storage: replicated into unavailability #36025

tbg opened this issue Mar 21, 2019 · 15 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@tbg
Copy link
Member

tbg commented Mar 21, 2019

Apparently there are situations in which ranges become unavailable with only one dead node. Assuming a replication factor of three which is initially fully replicated, no rebalancing decision that would result in a group size of two should ever be carried out (since we always upreplicate first before down-replicating).

For example, below we can see a range that was at one point 5x replicated and then get shrunk to 2x, at which point it became unavailable since one of its replicas resides on a dead node. This kind of replication decision simply amounts to a bug.

Access to replication is granted relatively freely via RelocateRanges. The algorithm in RelocateRanges isn't aware that it might be doing harm. It is likely what did harm here, perhaps triggered through merges, though I wasn't able to figure out the specifics without really digging in.

My initial analysis was this:

I just looked at your cluster and it looks the same - we're in a two-replica config and one of the peers is dead. The range log shows that the range split, then grew to five replicas, and then went down to two, all through "admin requests". This is storagepb.ReasonAdminRequest and is used when a ChangeReplicas is sent to the replica in a Batch. In production code, the only code path that does this is Store.AdminRelocateRange which - nasty code, but I'll save that rant - is only called into by the merge queue, relocateNode (guess you're not manually relocating things, so that one is out), and the StoreRebalancer. So look at whether the merge queue or the store rebalancer did something to that range.

Hrm, the merge queue might try to relocate the range without printing anything to the logs (there's nothing interesting there):

if !replicaSetsEqual(lhsDesc.Replicas, rhsDesc.Replicas) {
var targets []roachpb.ReplicationTarget
for _, lhsReplDesc := range lhsDesc.Replicas {
targets = append(targets, roachpb.ReplicationTarget{
NodeID: lhsReplDesc.NodeID, StoreID: lhsReplDesc.StoreID,
})
}
lease, _ := lhsRepl.GetLease()
for i := range targets {
if targets[i].NodeID == lease.Replica.NodeID && targets[i].StoreID == lease.Replica.StoreID {
if i > 0 {
targets[0], targets[i] = targets[i], targets[0]
}
break
}
}
// TODO(benesch): RelocateRange can sometimes fail if it needs to move a replica
// from one store to another store on the same node.
if err := mq.store.DB().AdminRelocateRange(ctx, rhsDesc.StartKey, targets); err != nil {
return err
}
}

RelocateRange doesn't look like it's very smart about staying out of dangerous replication configs. It compiles a list of adds and removes it needs to carry out. In this case, we initially have five replicas (understanding why that is, who knows). Let's assume the merge queue wants to colocate with another replica which has three replicas. Let's say only the first two overlap. To colocate, it thus has to remove three replicas, and add one. Unfortunately the algorithm does the operation of which there are more left first (on tie, it prefers to add). So it will do a removal, and another removal, but then it should do an addition (but we see a third removal). If it tried to colocate with another range that only had two replicas, we'd get the third removal. (The cluster only has 7 nodes, so there's no way that the other range overlapped less with three replicas). Then the question becomes, why does that other range have two replicas. The left neighboring range that r2948 originally split from doesn't exist any more, so it must've been merged away at some point, making it reasonable to assume that it might've tried to merge its right neighbor (r2948), too.

Either way, there's something to file here. You should try disabling the merge queue in what you're testing (assuming it isn't relevant, did we see merges during the inconsistency you're chasing?) to see if the problem goes away. Taking a bit of a step back, this is just yet another instance of how brittle our replication code is. Lots of places get to change the replication config but only some of them are actually considering the impact of them on availability.

image

We need to make sure no replication change that would result in a straightforward unavailability is permitted. All of these checks right now live in the allocator, which tries to not suggest such changes. However, RelocateRange does not use the allocator intentionally, because it does not want to be bound by constraints. This area of the code needs a general reworking.

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Mar 21, 2019
@tbg tbg added this to the 19.2 milestone Mar 21, 2019
@ajwerner
Copy link
Contributor

I'm going to add logging around RelocateRange and work on reproducing.

@ajwerner ajwerner self-assigned this Mar 25, 2019
@ajwerner
Copy link
Contributor

The good news is that this is readily reproducible. We start with three replicas and we want to add one and remove one. For reasons I don't yet understand we quickly add and then remove the new replica (on n4) and then proceed to remove n6 as planned leaving us with 2 replicas.

ip-172-31-39-197> I190325 20:29:07.272294 16158 storage/replica_command.go:1119  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] AdminRelocateRange: desc: r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n6,s6):5, (n9,s9):6, (n7,s7):4, next=7, gen=1], targets: [n9,s9 n7,s7 n4,s4], addTargets: [(n4,s4):?], removeTargets: [(n6,s6):?] 
ip-172-31-39-197> I190325 20:29:07.272743 16158 storage/store_snapshot.go:775  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] sending preemptive snapshot e59cc9fe at applied index 36                               
ip-172-31-39-197> I190325 20:29:07.272852 16158 storage/store_snapshot.go:818  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] streamed snapshot to (n4,s4):?: kv pairs: 11, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
ip-172-31-37-199> I190325 20:29:07.273155 36722 storage/replica_raftstorage.go:814  [n4,s4,r43/?:{-}] applying preemptive snapshot at index 36 (id=e59cc9fe, encoded size=846, 1 rocksdb batches, 0 log entries)  
ip-172-31-37-199> I190325 20:29:07.274950 36722 storage/replica_raftstorage.go:820  [n4,s4,r43/?:/Table/55/1/65{08/6/…-45/5/…}] applied preemptive snapshot in 2ms [clear=0ms batch=0ms entries=0ms commit=2ms]   
ip-172-31-39-197> I190325 20:29:07.275399 16158 storage/replica_command.go:788  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] change replicas (ADD_REPLICA (n4,s4):7): read existing descriptor r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n6,s6):5, (n9,s9):6, (n7,s7):4, next=7, gen=1]
ip-172-31-39-197> I190325 20:29:07.287446 16158 storage/replica_raft.go:394  [n6,s6,r43/5:/Table/55/1/65{08/6/…-45/5/…}] proposing ADD_REPLICA((n4,s4):7): updated=[(n6,s6):5 (n9,s9):6 (n7,s7):4 (n4,s4):7] next=8
ip-172-31-34-186> I190325 20:29:07.301543 14436 storage/replica_command.go:788  [n9,replicate,s9,r43/6:/Table/55/1/65{08/6/…-45/5/…}] change replicas (REMOVE_REPLICA (n4,s4):7): read existing descriptor r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n6,s6):5, (n9,s9):6, (n7,s7):4, (n4,s4):7, next=8, gen=1]
ip-172-31-34-186> I190325 20:29:07.307470 14436 storage/replica_raft.go:394  [n9,s9,r43/6:/Table/55/1/65{08/6/…-45/5/…}] proposing REMOVE_REPLICA((n4,s4):7): updated=[(n6,s6):5 (n9,s9):6 (n7,s7):4] next=8      
ip-172-31-37-199> I190325 20:29:07.317582 36748 storage/store.go:2467  [n4,replicaGC,s4,r43/7:/Table/55/1/65{08/6/…-45/5/…}] removing replica r43/7                                                               
ip-172-31-37-199> I190325 20:29:07.320198 36748 storage/replica_destroy.go:150  [n4,replicaGC,s4,r43/7:/Table/55/1/65{08/6/…-45/5/…}] removed 6 (0+6) keys in 3ms [clear=0ms commit=2ms]                                                                                                      
ip-172-31-34-186> I190325 20:29:11.585460 9160 storage/replica_command.go:1119  [n9,s9,r43/6:/Table/55/1/65{08/6/…-45/5/…}] AdminRelocateRange: desc: r43:/Table/55/1/65{08/6/-1062-45/5/-461} [(n7,s7):4, (n9,s9):6, next=8, gen=1], targets: [n9,s9 n7,s7 n4,s4], addTargets: [(n4,s4):?], removeTargets: []

@bdarnell
Copy link
Contributor

The removal comes from the replicate queue (you can see the ,replicate, in the log tags). It looks like you're racing against it and the queue noticed the over-replication. But when removing over-replication, it's supposed to exempt the most recently added replica. Maybe there's something wrong around lastReplicaAdded?

In any case, this is why Replica.ChangeReplicas takes a range descriptor for the expected value, instead of blindly applying the given delta to what's there. This is lost when going through AdminChangeReplicasRequest. This request needs to gain a RangeDescriptor field so that when AdminRelocateRange encounters an unexpected RangeDescriptor it can reset and recompute what changes need to be made.

@ajwerner
Copy link
Contributor

Indeed the other ranges also seem to have raced with the replicate queue. I'm going to attack this by first figuring out what's going on with the lastReplicaAdded thing and then look at adding a RangeDescriptor to AdminRelocateRange.

@bdarnell
Copy link
Contributor

Another option that might be simpler than adding a descriptor to AdminChangeReplicas (not AdminRelocateRange): Instead of computing all the deltas at once and then applying them, do it one at a time and refresh the range descriptor from the source each time.

Neither fixing lastReplicaAdded or making AdminRelocateRange refresh the range descriptor are complete solutions, they just reduce the window of vulnerability. Only plumbing an expected value through AdminChangeReplicas is a complete solution.

@ajwerner
Copy link
Contributor

Plumbing an expected RangeDescriptor into AdminChangeReplicas doesn't seem particularly involved.

That being said, refreshing the range descriptor will have to be a part of this fix too (though maybe it can be provided by the error after CPut failure). What's the idiomatic way to go look up the range descriptor? Nothing screamed at me going through client and the api proto.

@tbg
Copy link
Member Author

tbg commented Mar 26, 2019

if err := txn.GetProto(ctx, descKey, oldDesc); err != nil {
return err
}

is the code that reads it from the kv store. (Not sure if that's really your question)

BTW this whole AdminChangeReplicas sitting on the client.DB is nonsense, you can just use a client.Batch and use AddRawRequest and then get the response from b.RawResponse (and see b.MustPErr() for getting the structured *roachpb.Error. I don't know if you want to pull on removing that stuff now (probably not), but at least whichever usage of that you come across you probably want to replace with just peeling back the abstraction (raw requests).

@ajwerner
Copy link
Contributor

Cool, I just needed GetProto and keys.RangeDescriptorKey. I'm going to start with a minimal change and we can see how far to take it in review.

@ajwerner
Copy link
Contributor

Before I looked too carefully I thought this might be some issue with the propagation of the key which prevents merges for ranges being imported but upon deeper analysis it's clear that the import for the table in question had completed.

An interesting observation here is that we replicated these ranges into unavailability over 2 minutes after node 7 was killed (190325 20:26:59). The fix discussed above will protect us from this issue making us unavailable in the future and is worthwhile but I wonder if it's worth adding logic to the merge queue to ensure that all of the current replicas of both sides are alive. I'll add logic to check on liveness and avoid attempting to merge ranges which have replicas on dead nodes in a separate PR.

@ajwerner
Copy link
Contributor

The first of the PRs is up. The plan is to add two more.

  1. Add some low level sanity checks regarding liveness to Replica.changeReplicas
    • refuse to add a replica on a down node
    • refuse to remove a live replica if there already exists a replica on a down node
    • any others?
  2. Add some liveness checking to the merge queue
    • less important given 1) but still probably worth adding

ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 28, 2019
…ateRange

This commit adds a new ExpDesc RangeDescriptor field to the
AdminChangeReplicasRequest which is provided as the expected current
RangeDescriptor to the ChangeReplicas call. Providing this argument prevents
hazards caused by concurrent calls to ChangeReplicas. Given that the
AdminChangeReplicas RPC allows clients to specify multiple changes, the
provided expectation only directly applies to the first change. Subsequent
calls to ChangeReplicas use the value of the RangeDescriptor as returned from
the previous call to ChangeReplicas thus any concurrent modification of the
RangeDescriptor will lead to a failure. The change also adds adds a response
field to the AdminChangeReplicasResponse which is the updated value of the
RangeDescriptor upon successful completion of the request.

The client.DB interface is updated to adopt these new fields.

Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous
replication scenarios as observed in cockroachdb#36025.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 29, 2019
…ateRange

This commit adds a new ExpDesc RangeDescriptor field to the
AdminChangeReplicasRequest which is provided as the expected current
RangeDescriptor to the ChangeReplicas call. Providing this argument prevents
hazards caused by concurrent calls to ChangeReplicas. Given that the
AdminChangeReplicas RPC allows clients to specify multiple changes, the
provided expectation only directly applies to the first change. Subsequent
calls to ChangeReplicas use the value of the RangeDescriptor as returned from
the previous call to ChangeReplicas thus any concurrent modification of the
RangeDescriptor will lead to a failure. The change also adds adds a response
field to the AdminChangeReplicasResponse which is the updated value of the
RangeDescriptor upon successful completion of the request.

The client.DB interface is updated to adopt these new fields.

Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous
replication scenarios as observed in cockroachdb#36025.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 30, 2019
…ateRange

This commit adds a new ExpDesc RangeDescriptor field to the
AdminChangeReplicasRequest which is provided as the expected current
RangeDescriptor to the ChangeReplicas call. Providing this argument prevents
hazards caused by concurrent calls to ChangeReplicas. Given that the
AdminChangeReplicas RPC allows clients to specify multiple changes, the
provided expectation only directly applies to the first change. Subsequent
calls to ChangeReplicas use the value of the RangeDescriptor as returned from
the previous call to ChangeReplicas thus any concurrent modification of the
RangeDescriptor will lead to a failure. The change also adds adds a response
field to the AdminChangeReplicasResponse which is the updated value of the
RangeDescriptor upon successful completion of the request.

The client.DB interface is updated to adopt these new fields.

Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous
replication scenarios as observed in cockroachdb#36025.

Release note: None
craig bot pushed a commit that referenced this issue Mar 30, 2019
36244: Add expected RangeDescriptor to AdminChangeReplicas and adopt r=ajwerner a=ajwerner

This PR comes in two commits. The first updates the roachpb api and the second adopts the update. This is the first of the PRs which will address #36025. See the commit messages for more details. 

Co-authored-by: Andrew Werner <[email protected]>
ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 31, 2019
…ateRange

This commit adds a new ExpDesc RangeDescriptor field to the
AdminChangeReplicasRequest which is provided as the expected current
RangeDescriptor to the ChangeReplicas call. Providing this argument prevents
hazards caused by concurrent calls to ChangeReplicas. Given that the
AdminChangeReplicas RPC allows clients to specify multiple changes, the
provided expectation only directly applies to the first change. Subsequent
calls to ChangeReplicas use the value of the RangeDescriptor as returned from
the previous call to ChangeReplicas thus any concurrent modification of the
RangeDescriptor will lead to a failure. The change also adds adds a response
field to the AdminChangeReplicasResponse which is the updated value of the
RangeDescriptor upon successful completion of the request.

The client.DB interface is updated to adopt these new fields.

Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous
replication scenarios as observed in cockroachdb#36025.

Release note: None
@a-robinson
Copy link
Contributor

The removal comes from the replicate queue (you can see the ,replicate, in the log tags). It looks like you're racing against it and the queue noticed the over-replication

I'm not totally sure how I ended up here late on a Friday night (I blame @justinj), but that looks a lot like #31287, which as of a couple months I would have been very scared about pushing a new release without fixing. I haven't looked through your fix for this, but hopefully it also covers that.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 6, 2019

I haven't looked through your fix for this, but hopefully it also covers that.

I’m not sure that it totally fixes the test flake but it does fix the potential for under-replication due to the race. Thanks for bringing the issue to my attention.

craig bot pushed a commit that referenced this issue Jul 12, 2019
38843: roachpb: make AdminChangeReplicasRequest.ExpDesc non-nullable r=ajwerner a=ajwerner

Prior to 19.1 AdminChangeReplicas did not take an expected range descriptor
value which allowed for races with other changes to the range descriptor. For
backwards compatibility the expectation was left as an optional field. This
commit completes the migration and makes ExpDesc a required, non-nullable
field. Clients which supply a zero value ExpDesc will now receive an error.

Relates to #36025.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@bdarnell bdarnell removed their assignment Jul 31, 2019
@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@andreimatei
Copy link
Contributor

@tbg do you know anything about this issue?

@tbg
Copy link
Member Author

tbg commented Jun 8, 2021

We have since added low-level checks that make sure (to the degree it can reasonably be done) that the "new majority" is live.

@tbg tbg closed this as completed Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

5 participants