-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator: voter constraint never satisfied when there are a correct number of replicas and all existing replicas are necessary #106559
Comments
I wonder if this would work with a normalized span config, where num_replicas=sum(num_replicas) in the constraints. |
I can't see how this issue is easily fixed. When there are the correct voters and replicas, This case does not work because the voter we should demote is satisfying an all replica constraint conjunction. The rebalance code assumes the rebalance-from store will be removed entirely, so an all-replica constraint will no longer be satisfied. We could hack in a check for this specific case when rebalancing voters, allowing the voter to be removed - which then triggers up-replication to add a non-voter. This is wasteful, we could instead demote the existing voter and promote the existing non-voter, incurring 0 snapshots. Ideally, something like the cockroach/pkg/kv/kvserver/allocator/allocator2/constraint.go Lines 1239 to 1252 in 5e6cc9e
It would also be preferable to create an |
Add a simulator reproduction for the voter constraint violation bug tracked in cockroachdb#106559. The reproduction uses 10 ranges, copying the locality setup seen in the linked issue. Initially, the span config specifies the final replica position seen in the issue, then after 15 minutes switches the span config to be identical to the issue. As expected, the conformance assertion fails due to voter constraint violations. Part of: cockroachdb#106559 Epic: None Release note: None
Previously, it was possible for a satisfiable voter constraint to never be satisfied when: 1. There were a correct number of `VOTER` and `NON_VOTER` replicas. 2. All existing replicas were necessary to satisfy a replica constraint, or voter constraint. The allocator relies on the `RebalanceVoter` path to resolve voter constraint violations when there are a correct number of each replica type. Candidates which are `necessary` to satisfy a constraint are ranked higher as rebalance targets than those which are not. Under most circumstances this leads to constraint conformance. However, when every existing replica is necessary to satisfy a replica constraint, and a voter constraint is unsatisfied -- `RebalanceVoter` would not consider swapping a `VOTER` and `NON_VOTER` to satisfy the constraint. For example, consider a setup where there are two stores, one in locality `a` and the other `b`, where some range has the following config and initial placement: ``` replicas = a(non-voter) b(voter) constraints = a:1 b:1 voter_constraints = a:1 ``` In this example, the only satisfiable placement is `a(voter)` `b(non-voter)`, which would require promoting `a(non-voter) -> a(voter)`, and demoting `b(voter)->b(non-voter)`. However, both are necessary to satisfy `constraints` leading to no rebalance occurring. Add an additional field to the allocator candidate struct, which is used to sort rebalance candidates. The new field, `voterNecessary` is sorted strictly after `necessary`, but before `diversityScore`. The `voterNecessary` field can be true only when rebalancing voters, and when the rebalance candidate is necessary to satisfy a voter constraint, the rebalance candidate already has a non-voter, and the existing voter is not necessary to satisfy *any* voter constraint. Note these rebalances are turned into swaps (promotion and demotion) in `plan.ReplicationChangesForRebalance`, so incur no snapshots. Fixes: cockroachdb#98020 Fixes: cockroachdb#106559 Fixes: cockroachdb#108127 Release note (bug fix): Voter constraints which were never satisfied due to all existing replicas being considered necessary to satisfy a replica constraint, will now be satisfied by promoting existing non-voters.
Add a simulator reproduction for the voter constraint violation bug tracked in cockroachdb#106559. The reproduction uses 10 ranges, using a simplified locality setup to that seen in the linked issue. Initially, the span config specifies the final replica position seen in the issue, then after 5 minutes switches the span config to require a voter <-> non-voter swap between the localities. As expected, the conformance assertion fails due to voter constraint violations. A datadriven simulator command `SetNodeLocality` is added to simplify the reproduction. Part of: cockroachdb#106559 Epic: None Release note: None
111609: allocator: prioritize non-voter promotion to satisfy voter constraints r=sumeerbhola a=kvoli Previously, it was possible for a satisfiable voter constraint to never be satisfied when: 1. There were a correct number of `VOTER` and `NON_VOTER` replicas. 2. All existing replicas were necessary to satisfy a replica constraint, or voter constraint. The allocator relies on the `RebalanceVoter` path to resolve voter constraint violations when there are a correct number of each replica type. Candidates which are `necessary` to satisfy a constraint are ranked higher as rebalance targets than those which are not. Under most circumstances this leads to constraint conformance. However, when every existing replica is necessary to satisfy a replica constraint, and a voter constraint is unsatisfied -- `RebalanceVoter` would not consider swapping a `VOTER` and `NON_VOTER` to satisfy the constraint. For example, consider a setup where there are two stores, one in locality `a` and the other `b`, where some range has the following config and initial placement: ``` replicas = a(non-voter) b(voter) constraints = a:1 b:1 voter_constraints = a:1 ``` In this example, the only satisfiable placement is `a(voter)` `b(non-voter)`, which would require promoting `a(non-voter) -> a(voter)`, and demoting `b(voter)->b(non-voter)`. However, both are necessary to satisfy `constraints` leading to no rebalance occurring. Add an additional field to the allocator candidate struct, which is used to sort rebalance candidates. The new field, `voterNecessary` is sorted strictly after `necessary`, but before `diversityScore`. The `voterNecessary` field can be true only when rebalancing voters, and when the rebalance candidate is necessary to satisfy a voter constraint, the rebalance candidate already has a non-voter, and the existing voter is not necessary to satisfy *any* voter constraint. Note these rebalances are turned into swaps (promotion and demotion) in `plan.ReplicationChangesForRebalance`, so incur no snapshots. Fixes: #98020 Fixes: #106559 Fixes: #108127 Release note (bug fix): Voter constraints which were never satisfied due to all existing replicas being considered necessary to satisfy a replica constraint, will now be satisfied by promoting existing non-voters. Co-authored-by: Austen McClernon <[email protected]>
We now also do some normalization of constraints (in addition to voter constraints). This helps with a better choice of where to add a non-voter if the voter constraints are temporarily unsatisfiable. This was motivated by looking at the config in cockroachdb#106559 (though this is unrelated to the bug there). Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None
We now also do some normalization of constraints (in addition to voter constraints). This helps with a better choice of where to add a non-voter if the voter constraints are temporarily unsatisfiable. This was motivated by looking at the config in cockroachdb#106559 (though this is unrelated to the bug there). Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None
We now also do some normalization of constraints (in addition to voter constraints). This helps with a better choice of where to add a non-voter if the voter constraints are temporarily unsatisfiable. This was motivated by looking at the config in cockroachdb#106559 (though this is unrelated to the bug there). Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None
We now also do some normalization of constraints (in addition to voter constraints). This helps with a better choice of where to add a non-voter if the voter constraints are temporarily unsatisfiable. This was motivated by looking at the config in cockroachdb#106559 (though this is unrelated to the bug there). Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None
111918: mma: more normalization of constraints r=kvoli a=sumeerbhola We now also do some normalization of constraints (in addition to voter constraints). This helps with a better choice of where to add a non-voter if the voter constraints are temporarily unsatisfiable. This was motivated by looking at the config in #106559 (though this is unrelated to the bug there). Informs #103320 Epic: CRDB-25222 Release note: None Co-authored-by: sumeerbhola <[email protected]>
Describe the problem
When there are the correct number of voters/non-voters and all existing replicas are necessarily satisfying some constraint, the allocator may not satisfy any remaining constraints - even though it should.
Consider the example from #98020
Node Localities
Span config:
Existing replicas:
There are the correct number of voters/replicas. All existing replicas are necessary (from allocators perspective) to satisfy the existing constraints—including the non-voter on n3.
n3 should be promoted to a voter, then either n5 (us-central-1) or n8 (eu-west-1) should be demoted to a non-voter. i.e., a rebalance with a promotion and demotion.
However this will never occur, as we get stuck on:
cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator_scorer.go
Lines 1644 to 1648 in 5e6cc9e
Which checks if the existing replica (necessary=true) is less than the replacement (necessary=true). If the replacement candidate inputs (diversity score) is not greater than the existing, we will never rebalance to satisfy the constraint.
To Reproduce
#106548 reproduces the no rebalance target behavior. Reproduction using the simulator and more directly when calling
Allocator.Rebalance(Non)Voter
.Expected behavior
Rebalance action returned with a promotion and demotion.
Environment:
Reproduces on master. Assume all versions affected.
Jira issue: CRDB-29615
The text was updated successfully, but these errors were encountered: