Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocator: voter constraint never satisfied when there are a correct number of replicas and all existing replicas are necessary #106559

Closed
kvoli opened this issue Jul 10, 2023 · 3 comments · Fixed by #111609
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 T-kv KV Team

Comments

@kvoli
Copy link
Collaborator

kvoli commented Jul 10, 2023

Describe the problem

When there are the correct number of voters/non-voters and all existing replicas are necessarily satisfying some constraint, the allocator may not satisfy any remaining constraints - even though it should.

Consider the example from #98020

Node Localities

# region -> node_id
eu-west-1
  └── [8 9 10]
us-central-1
  └── [5 6 7]
us-east-1
  └── [1 2]
us-west-1
  └── [3 4]

Span config:

num_replicas=6 
num_voters=5 
constraints={'+region=eu-west-1':1,'+region=us-central-1':1,'+region=us-east-1':1,'+region=us-west-1':1} 
voter_constraints={'+region=us-west-1':2,'+region=us-east-1':2}

Existing replicas:

voters=[n1,n2,n4,n5,n8]
non_voters=[n3]

There are the correct number of voters/replicas. All existing replicas are necessary (from allocators perspective) to satisfy the existing constraints—including the non-voter on n3.

n3 should be promoted to a voter, then either n5 (us-central-1) or n8 (eu-west-1) should be demoted to a non-voter. i.e., a rebalance with a promotion and demotion.

However this will never occur, as we get stuck on:

if !cand.less(existing) {
// If `cand` is not worse than `existing`, add it to the list.
comparableCands = append(comparableCands, cand)
if !needRebalanceFrom && !needRebalanceTo && existing.less(cand) {

Which checks if the existing replica (necessary=true) is less than the replacement (necessary=true). If the replacement candidate inputs (diversity score) is not greater than the existing, we will never rebalance to satisfy the constraint.

To Reproduce

#106548 reproduces the no rebalance target behavior. Reproduction using the simulator and more directly when calling Allocator.Rebalance(Non)Voter.

Expected behavior

Rebalance action returned with a promotion and demotion.

Environment:
Reproduces on master. Assume all versions affected.

Jira issue: CRDB-29615

@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Jul 10, 2023
@kvoli kvoli changed the title allocator: constraint never satisfied when there is correct number of non-voters/voters and all existing replicas are necessary allocator: constraint never satisfied when there are a correct number of non-voters/voters and all existing replicas are necessary Jul 10, 2023
@kvoli
Copy link
Collaborator Author

kvoli commented Jul 10, 2023

I wonder if this would work with a normalized span config, where num_replicas=sum(num_replicas) in the constraints.

@kvoli kvoli changed the title allocator: constraint never satisfied when there are a correct number of non-voters/voters and all existing replicas are necessary allocator: constraint never satisfied when there are a correct number of replicas and all existing replicas are necessary Jul 11, 2023
@kvoli
Copy link
Collaborator Author

kvoli commented Jul 11, 2023

I can't see how this issue is easily fixed. When there are the correct voters and replicas, considerRebalance is used to fix constraint unsatisfaction.

This case does not work because the voter we should demote is satisfying an all replica constraint conjunction. The rebalance code assumes the rebalance-from store will be removed entirely, so an all-replica constraint will no longer be satisfied.

We could hack in a check for this specific case when rebalancing voters, allowing the voter to be removed - which then triggers up-replication to add a non-voter. This is wasteful, we could instead demote the existing voter and promote the existing non-voter, incurring 0 snapshots.

Ideally, something like the roleSwap in allocator2 would be preferable:

// Constraint unsatisfaction can happen if constraints or store attributes
// changed. The presence of two sets of constraints-conjunctions (CCs),
// complicates matters. Consider CC's A, B, C, that constrain all replicas and
// require 2 replicas each. And consider CC's A', B', C' that constrain voters
// and require 1 replica each. And assume that A' permits the same set as A,
// ... (not that we know that in the code). The current state is 2 non-voters
// in A, 1 voter and 1 non-voter in B, and 2 voters in C. We can't find
// anything unsatisfied in the regular constraints. But have an unsatisfied
// (and oversatisfied) voter constraint. We can decide to satisfy A' and
// remove a voter from C'. But removing a replica from C' will also unsatisfy
// C. What this requires is switching a voter in C to non-voter and switching
// a non-voter in A' to voter.
// REQUIRES: !notEnoughVoters() && !notEnoughNonVoters()
func (rac *rangeAnalyzedConstraints) candidatesForRoleSwapForConstraints() (

It would also be preferable to create an AllocatorAction for this situation, so that rebalancing isn't depended on for constraint satisfaction #90110.

@kvoli kvoli removed their assignment Aug 3, 2023
@kvoli kvoli changed the title allocator: constraint never satisfied when there are a correct number of replicas and all existing replicas are necessary allocator: voter constraint never satisfied when there are a correct number of replicas and all existing replicas are necessary Aug 7, 2023
@kvoli kvoli self-assigned this Aug 15, 2023
@kvoli kvoli added the branch-master Failures and bugs on the master branch. label Sep 12, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Oct 3, 2023
Add a simulator reproduction for the voter constraint violation bug
tracked in cockroachdb#106559.

The reproduction uses 10 ranges, copying the locality setup seen in the
linked issue. Initially, the span config specifies the final replica
position seen in the issue, then after 15 minutes switches the span
config to be identical to the issue.

As expected, the conformance assertion fails due to voter constraint
violations.

Part of: cockroachdb#106559
Epic: None
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Oct 3, 2023
Previously, it was possible for a satisfiable voter constraint to never
be satisfied when:

1. There were a correct number of `VOTER` and `NON_VOTER` replicas.
2. All existing replicas were necessary to satisfy a replica constraint,
   or voter constraint.

The allocator relies on the `RebalanceVoter` path to resolve voter
constraint violations when there are a correct number of each replica
type.

Candidates which are `necessary` to satisfy a constraint are
ranked higher as rebalance targets than those which are not. Under most
circumstances this leads to constraint conformance. However, when every
existing replica is necessary to satisfy a replica constraint, and a
voter constraint is unsatisfied -- `RebalanceVoter` would not consider
swapping a `VOTER` and `NON_VOTER` to satisfy the constraint.

For example, consider a setup where there are two stores, one in
locality `a` and the other `b`, where some range has the following
config and initial placement:

```
replicas          = a(non-voter) b(voter)
constraints       = a:1 b:1
voter_constraints = a:1
```

In this example, the only satisfiable placement is `a(voter)`
`b(non-voter)`, which would require promoting `a(non-voter) ->
a(voter)`, and demoting `b(voter)->b(non-voter)`. However, both are
necessary to satisfy `constraints` leading to no rebalance occurring.

Add an additional field to the allocator candidate struct, which is used
to sort rebalance candidates. The new field, `voterNecessary` is sorted
strictly after `necessary`, but before `diversityScore`.

The `voterNecessary` field can be true only when rebalancing voters, and
when the rebalance candidate is necessary to satisfy a voter constraint,
the rebalance candidate already has a non-voter, and the existing voter
is not necessary to satisfy *any* voter constraint.

Note these rebalances are turned into swaps (promotion and demotion) in
`plan.ReplicationChangesForRebalance`, so incur no snapshots.

Fixes: cockroachdb#98020
Fixes: cockroachdb#106559
Fixes: cockroachdb#108127

Release note (bug fix): Voter constraints which were never satisfied due
to all existing replicas being considered necessary to satisfy a replica
constraint, will now be satisfied by promoting existing non-voters.
kvoli added a commit to kvoli/cockroach that referenced this issue Oct 6, 2023
Add a simulator reproduction for the voter constraint violation bug
tracked in cockroachdb#106559.

The reproduction uses 10 ranges, using a simplified locality setup to
that seen in the linked issue. Initially, the span config specifies the
final replica position seen in the issue, then after 5 minutes switches
the span config to require a voter <-> non-voter swap between the
localities.

As expected, the conformance assertion fails due to voter constraint
violations.

A datadriven simulator command `SetNodeLocality` is added to simplify
the reproduction.

Part of: cockroachdb#106559
Epic: None
Release note: None
craig bot pushed a commit that referenced this issue Oct 6, 2023
111609: allocator: prioritize non-voter promotion to satisfy voter constraints r=sumeerbhola a=kvoli

Previously, it was possible for a satisfiable voter constraint to never
be satisfied when:

1. There were a correct number of `VOTER` and `NON_VOTER` replicas.
2. All existing replicas were necessary to satisfy a replica constraint,
   or voter constraint.

The allocator relies on the `RebalanceVoter` path to resolve voter
constraint violations when there are a correct number of each replica
type.

Candidates which are `necessary` to satisfy a constraint are
ranked higher as rebalance targets than those which are not. Under most
circumstances this leads to constraint conformance. However, when every
existing replica is necessary to satisfy a replica constraint, and a
voter constraint is unsatisfied -- `RebalanceVoter` would not consider
swapping a `VOTER` and `NON_VOTER` to satisfy the constraint.

For example, consider a setup where there are two stores, one in
locality `a` and the other `b`, where some range has the following
config and initial placement:

```
replicas          = a(non-voter) b(voter)
constraints       = a:1 b:1
voter_constraints = a:1
```

In this example, the only satisfiable placement is `a(voter)`
`b(non-voter)`, which would require promoting `a(non-voter) ->
a(voter)`, and demoting `b(voter)->b(non-voter)`. However, both are
necessary to satisfy `constraints` leading to no rebalance occurring.

Add an additional field to the allocator candidate struct, which is used
to sort rebalance candidates. The new field, `voterNecessary` is sorted
strictly after `necessary`, but before `diversityScore`.

The `voterNecessary` field can be true only when rebalancing voters, and
when the rebalance candidate is necessary to satisfy a voter constraint,
the rebalance candidate already has a non-voter, and the existing voter
is not necessary to satisfy *any* voter constraint.

Note these rebalances are turned into swaps (promotion and demotion) in
`plan.ReplicationChangesForRebalance`, so incur no snapshots.

Fixes: #98020
Fixes: #106559
Fixes: #108127

Release note (bug fix): Voter constraints which were never satisfied due
to all existing replicas being considered necessary to satisfy a replica
constraint, will now be satisfied by promoting existing non-voters.

Co-authored-by: Austen McClernon <[email protected]>
@craig craig bot closed this as completed in ed51752 Oct 6, 2023
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Oct 6, 2023
We now also do some normalization of constraints (in addition to
voter constraints). This helps with a better choice of where to add a
non-voter if the voter constraints are temporarily unsatisfiable.
This was motivated by looking at the config in cockroachdb#106559 (though this
is unrelated to the bug there).

Informs cockroachdb#103320

Epic: CRDB-25222

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Oct 18, 2023
We now also do some normalization of constraints (in addition to
voter constraints). This helps with a better choice of where to add a
non-voter if the voter constraints are temporarily unsatisfiable.
This was motivated by looking at the config in cockroachdb#106559 (though this
is unrelated to the bug there).

Informs cockroachdb#103320

Epic: CRDB-25222

Release note: None
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 21, 2025
We now also do some normalization of constraints (in addition to
voter constraints). This helps with a better choice of where to add a
non-voter if the voter constraints are temporarily unsatisfiable.
This was motivated by looking at the config in cockroachdb#106559 (though this
is unrelated to the bug there).

Informs cockroachdb#103320

Epic: CRDB-25222

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 21, 2025
We now also do some normalization of constraints (in addition to
voter constraints). This helps with a better choice of where to add a
non-voter if the voter constraints are temporarily unsatisfiable.
This was motivated by looking at the config in cockroachdb#106559 (though this
is unrelated to the bug there).

Informs cockroachdb#103320

Epic: CRDB-25222

Release note: None
craig bot pushed a commit that referenced this issue Feb 22, 2025
111918: mma: more normalization of constraints r=kvoli a=sumeerbhola

We now also do some normalization of constraints (in addition to voter constraints). This helps with a better choice of where to add a non-voter if the voter constraints are temporarily unsatisfiable. This was motivated by looking at the config in #106559 (though this is unrelated to the bug there).

Informs #103320

Epic: CRDB-25222

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants