-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: replicated into unavailability #36025
Comments
I'm going to add logging around RelocateRange and work on reproducing. |
The good news is that this is readily reproducible. We start with three replicas and we want to add one and remove one. For reasons I don't yet understand we quickly add and then remove the new replica (on
|
The removal comes from the replicate queue (you can see the In any case, this is why Replica.ChangeReplicas takes a range descriptor for the expected value, instead of blindly applying the given delta to what's there. This is lost when going through AdminChangeReplicasRequest. This request needs to gain a RangeDescriptor field so that when AdminRelocateRange encounters an unexpected RangeDescriptor it can reset and recompute what changes need to be made. |
Indeed the other ranges also seem to have raced with the replicate queue. I'm going to attack this by first figuring out what's going on with the lastReplicaAdded thing and then look at adding a RangeDescriptor to AdminRelocateRange. |
Another option that might be simpler than adding a descriptor to AdminChangeReplicas (not AdminRelocateRange): Instead of computing all the deltas at once and then applying them, do it one at a time and refresh the range descriptor from the source each time. Neither fixing lastReplicaAdded or making AdminRelocateRange refresh the range descriptor are complete solutions, they just reduce the window of vulnerability. Only plumbing an expected value through AdminChangeReplicas is a complete solution. |
Plumbing an expected That being said, refreshing the range descriptor will have to be a part of this fix too (though maybe it can be provided by the error after CPut failure). What's the idiomatic way to go look up the range descriptor? Nothing screamed at me going through client and the api proto. |
cockroach/pkg/storage/replica_command.go Lines 785 to 787 in 56b79f1
is the code that reads it from the BTW this whole |
Cool, I just needed |
Before I looked too carefully I thought this might be some issue with the propagation of the key which prevents merges for ranges being imported but upon deeper analysis it's clear that the import for the table in question had completed. An interesting observation here is that we replicated these ranges into unavailability over 2 minutes after node 7 was killed (190325 20:26:59). The fix discussed above will protect us from this issue making us unavailable in the future and is worthwhile but I wonder if it's worth adding logic to the merge queue to ensure that all of the current replicas of both sides are alive. I'll add logic to check on liveness and avoid attempting to merge ranges which have replicas on dead nodes in a separate PR. |
The first of the PRs is up. The plan is to add two more.
|
…ateRange This commit adds a new ExpDesc RangeDescriptor field to the AdminChangeReplicasRequest which is provided as the expected current RangeDescriptor to the ChangeReplicas call. Providing this argument prevents hazards caused by concurrent calls to ChangeReplicas. Given that the AdminChangeReplicas RPC allows clients to specify multiple changes, the provided expectation only directly applies to the first change. Subsequent calls to ChangeReplicas use the value of the RangeDescriptor as returned from the previous call to ChangeReplicas thus any concurrent modification of the RangeDescriptor will lead to a failure. The change also adds adds a response field to the AdminChangeReplicasResponse which is the updated value of the RangeDescriptor upon successful completion of the request. The client.DB interface is updated to adopt these new fields. Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous replication scenarios as observed in cockroachdb#36025. Release note: None
…ateRange This commit adds a new ExpDesc RangeDescriptor field to the AdminChangeReplicasRequest which is provided as the expected current RangeDescriptor to the ChangeReplicas call. Providing this argument prevents hazards caused by concurrent calls to ChangeReplicas. Given that the AdminChangeReplicas RPC allows clients to specify multiple changes, the provided expectation only directly applies to the first change. Subsequent calls to ChangeReplicas use the value of the RangeDescriptor as returned from the previous call to ChangeReplicas thus any concurrent modification of the RangeDescriptor will lead to a failure. The change also adds adds a response field to the AdminChangeReplicasResponse which is the updated value of the RangeDescriptor upon successful completion of the request. The client.DB interface is updated to adopt these new fields. Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous replication scenarios as observed in cockroachdb#36025. Release note: None
…ateRange This commit adds a new ExpDesc RangeDescriptor field to the AdminChangeReplicasRequest which is provided as the expected current RangeDescriptor to the ChangeReplicas call. Providing this argument prevents hazards caused by concurrent calls to ChangeReplicas. Given that the AdminChangeReplicas RPC allows clients to specify multiple changes, the provided expectation only directly applies to the first change. Subsequent calls to ChangeReplicas use the value of the RangeDescriptor as returned from the previous call to ChangeReplicas thus any concurrent modification of the RangeDescriptor will lead to a failure. The change also adds adds a response field to the AdminChangeReplicasResponse which is the updated value of the RangeDescriptor upon successful completion of the request. The client.DB interface is updated to adopt these new fields. Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous replication scenarios as observed in cockroachdb#36025. Release note: None
36244: Add expected RangeDescriptor to AdminChangeReplicas and adopt r=ajwerner a=ajwerner This PR comes in two commits. The first updates the roachpb api and the second adopts the update. This is the first of the PRs which will address #36025. See the commit messages for more details. Co-authored-by: Andrew Werner <[email protected]>
…ateRange This commit adds a new ExpDesc RangeDescriptor field to the AdminChangeReplicasRequest which is provided as the expected current RangeDescriptor to the ChangeReplicas call. Providing this argument prevents hazards caused by concurrent calls to ChangeReplicas. Given that the AdminChangeReplicas RPC allows clients to specify multiple changes, the provided expectation only directly applies to the first change. Subsequent calls to ChangeReplicas use the value of the RangeDescriptor as returned from the previous call to ChangeReplicas thus any concurrent modification of the RangeDescriptor will lead to a failure. The change also adds adds a response field to the AdminChangeReplicasResponse which is the updated value of the RangeDescriptor upon successful completion of the request. The client.DB interface is updated to adopt these new fields. Lastly this change is adopted inside of AdminRelocateRange to prevent dangerous replication scenarios as observed in cockroachdb#36025. Release note: None
I'm not totally sure how I ended up here late on a Friday night (I blame @justinj), but that looks a lot like #31287, which as of a couple months I would have been very scared about pushing a new release without fixing. I haven't looked through your fix for this, but hopefully it also covers that. |
I’m not sure that it totally fixes the test flake but it does fix the potential for under-replication due to the race. Thanks for bringing the issue to my attention. |
38843: roachpb: make AdminChangeReplicasRequest.ExpDesc non-nullable r=ajwerner a=ajwerner Prior to 19.1 AdminChangeReplicas did not take an expected range descriptor value which allowed for races with other changes to the range descriptor. For backwards compatibility the expectation was left as an optional field. This commit completes the migration and makes ExpDesc a required, non-nullable field. Clients which supply a zero value ExpDesc will now receive an error. Relates to #36025. Release note: None Co-authored-by: Andrew Werner <[email protected]>
We have marked this issue as stale because it has been inactive for |
@tbg do you know anything about this issue? |
We have since added low-level checks that make sure (to the degree it can reasonably be done) that the "new majority" is live. |
Apparently there are situations in which ranges become unavailable with only one dead node. Assuming a replication factor of three which is initially fully replicated, no rebalancing decision that would result in a group size of two should ever be carried out (since we always upreplicate first before down-replicating).
For example, below we can see a range that was at one point 5x replicated and then get shrunk to 2x, at which point it became unavailable since one of its replicas resides on a dead node. This kind of replication decision simply amounts to a bug.
Access to replication is granted relatively freely via
RelocateRanges
. The algorithm inRelocateRanges
isn't aware that it might be doing harm. It is likely what did harm here, perhaps triggered through merges, though I wasn't able to figure out the specifics without really digging in.My initial analysis was this:
Hrm, the merge queue might try to relocate the range without printing anything to the logs (there's nothing interesting there):
cockroach/pkg/storage/merge_queue.go
Lines 271 to 292 in 7eed200
RelocateRange doesn't look like it's very smart about staying out of dangerous replication configs. It compiles a list of adds and removes it needs to carry out. In this case, we initially have five replicas (understanding why that is, who knows). Let's assume the merge queue wants to colocate with another replica which has three replicas. Let's say only the first two overlap. To colocate, it thus has to remove three replicas, and add one. Unfortunately the algorithm does the operation of which there are more left first (on tie, it prefers to add). So it will do a removal, and another removal, but then it should do an addition (but we see a third removal). If it tried to colocate with another range that only had two replicas, we'd get the third removal. (The cluster only has 7 nodes, so there's no way that the other range overlapped less with three replicas). Then the question becomes, why does that other range have two replicas. The left neighboring range that r2948 originally split from doesn't exist any more, so it must've been merged away at some point, making it reasonable to assume that it might've tried to merge its right neighbor (r2948), too.
Either way, there's something to file here. You should try disabling the merge queue in what you're testing (assuming it isn't relevant, did we see merges during the inconsistency you're chasing?) to see if the problem goes away. Taking a bit of a step back, this is just yet another instance of how brittle our replication code is. Lots of places get to change the replication config but only some of them are actually considering the impact of them on availability.
We need to make sure no replication change that would result in a straightforward unavailability is permitted. All of these checks right now live in the allocator, which tries to not suggest such changes. However, RelocateRange does not use the allocator intentionally, because it does not want to be bound by constraints. This area of the code needs a general reworking.
The text was updated successfully, but these errors were encountered: