Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demote single failed server first during reconciliation with an odd number of voters #55

Conversation

kubawi
Copy link
Contributor

@kubawi kubawi commented Sep 11, 2024

This PR updates the autopilot reconciliation logic to demote a single unhealthy voter before carrying out promotions in an (Autopilot).reconcile call, when there is an odd number of voters and we at least one of them is unhealthy.

The purpose of this change is to avoid increasing the quorum when reconciling in presence of unhealthy voters, when the overall number of voters is odd. That temporary change of quorum makes the cluster vulnerable to failure in the scenario described below.

In Vault, we've encountered an issue where losing a voter, then a non-voter in the same redundancy zone in short succession, would lead to the cluster becoming unhealthy (losing quorum).

In the scenario above, the following would happen before the change in this PR (write-up stolen from @banks):

  • Given a cluster with 3 redundancy zones and 6 nodes.
  • When a voter fails it is “demoted” but what this actually means is that Autopilot will first promote a follower (with a raft reconfiguration), this increases the quorum from 2 to 3. We now have 4 voters, but only 3 of them are healthy.
  • At some time later—once the new voter is healthy—AP should remove the old voter and actually “demote” them to non-voter.
  • But if any of the healthy voters fail before the failed one is demoted, we end up with a quorum of 4 but only 2 available voters and so the leader is forced to step down, as it no longer has a majority of voters.

@kubawi kubawi self-assigned this Sep 11, 2024
@kubawi kubawi changed the title Demote failed servers first during reconciliation. Demote failed servers first during reconciliation Sep 12, 2024
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @kubawi!

This looks like a clean implementation of the approach discussed where we ensure that we demote unhealthy servers first.

This implementation also implicitly ensures that we only change by one node in a given raft config change (actually our Raft library ensures that by only supporting a single Add/Remove node call at once).

One thing I'm not clear on: can this code change add a new way to violate min_quorum i.e. demote more than f unhealthy servers before promoting their replacements? I think that is calculated elsewhere and this is just applying, but it's worth checking that those calculations don't implicitly rely on promotions always happening first (e.g. they might allow a demotion that would violate min_quorum because they also include a promotion and assume that would happen first?). That would be bad!

I'm even wondering if we need to do something more sophisticated even when min_quorum isn't set 🤔 .

For example, just a hypothetical case where there are 10 nodes across 5 RZs and so 5 voters..

RZ1 RZ2 RZ3 RZ4 RZ5
A* B* C* D* E*
F G H I J

If two zones went down at the same time (say 4 and 5) we could have D* and E* being demoted at the same time.

If we just demote them both first and then promote two others (say F and G), there is a point after the demotions are done where there are only 3 voters which has changed the quorum from 3 of 5 to 2 of 3 which is probably not what we want! I'm not sure if min_quorum being set to 3 in this case would prevent that but it should! But even assuming that min_quorum isn't set it would be better in this case if we instead did:

  1. Demote E
  2. Promote F
  3. Demote D
  4. Promote G

That way there would always be at least 4 voters in the configuration so we wouldn't change the quorum size.

The question is... would that re-introduce the same problem this is trying to fix if the health status of nodes is stale? In other words, what if the leader still sees I and J as healthy even though those zones have died?

Well then it would try:

  1. Demote E : OK, committed by {A, B, C}, voters now {A, B, C, D}
  2. Promote J: OK, committed by {A, B, C}, voters now {A, B, C, D, J} (but D and J are dead)
  3. Demote D: OK committed by {A, B, C}, voters now {A, B, C, J}
  4. Promote I: OK commited by {A, B, C}, voters now {A, B, C, I, J}

The cluster is still OK at this point since we still have 3 out of 5 voters healthy even though we just accidentally promoted two unhealthy servers.

Pretty soon we'd notice they are unhealthy and on a future round (say in 10 seconds) we'd demote I and J and promote two healthy standbys from other zones instead.

So I think this is safe to interleave the demotions and promotions but has the benefit that it will not reduce the actual quorum size at any point.

... although that's assuming that we started off with an odd number of voters, which should be the case in general because it doesn't make sense to deploy even numbers of voters with a majority quorum scheme. But it's possible we are in an interim state where there are an even number of voters so we should consider that case. This could happen because of misconfiguration, because we started with 5 zones and one died earlier and isn't back up yet etc.

Let's say our cluster was meant to be over 5 zones, but one failed and we've been operating for a while in the following state (*means voter). Lets also assume we have a bunch of other servers unheathy (strikethrough).

RZ1 RZ2 RZ3 RZ4 RZ5
A* B* C* D* E
F* G H I J

Now the same issue that has taken out G, H and I strikes F which becomes unhealthy. Next autopilot run we demote the unhealthy F but have no healthy candidates to promote (this is true both before and after this PR change).

The cluster is still up with 4 out of 5 voters! We could even tolerate one more failing. Current state:

RZ1 RZ2 RZ3 RZ4 RZ5
A* B* C* D* E
F G H I J

Now one of our standby nodes restarts and recovers and send a heartbeat to the leader just before the next AP reconcile. Sadly, we also loose RZ4 at exactly the same moment so now D is unhealthy. AP now sees:

RZ1 RZ2 RZ3 RZ4 RZ5
A* B* C* D* E
F G H I J

Before this PR, it would promote G first maintaining 4 voters, before demoting D.

Now (even if we did the alternating thing I suggested above) we would demote D taking the cluster down to only 3 voters and so shrinking the quorum size to 2. before promoting G which brings it back up. 🤔

I think we need to ensure that we can't violate min_quorum with this change - that seems to be the most important point. It's not obvious to me yet if we can or not.

The second part is even if min_quorum isn't set can we still try to avoid reducing quorum size when demoting unhealthy servers by only demoting one unhealthy server at a time and only if there is an odd number of voters to start with?

I think that would be the most conservative approach:

  • Rather than demoting all unhealthy first, demote at most one unhealthy voter each round before promotions (not sure if it's OK to promote more than one here though need to think)
  • Don't demote before promote if there is an even number of voters.

That would avoid introducing any new case where quorum size is changed while still solving the known issue of an RZ failure causing outage.

reconcile.go Outdated Show resolved Hide resolved
@kubawi kubawi requested a review from banks November 18, 2024 11:46
@kubawi kubawi marked this pull request as ready for review November 18, 2024 12:49
@kubawi kubawi requested a review from a team as a code owner November 18, 2024 12:49
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great Kuba. A couple of thinkers inline. Not sure if they need changes yet but curious to hear what you and others think.

reconcile.go Outdated Show resolved Hide resolved
reconcile.go Outdated Show resolved Hide resolved
reconcile_test.go Outdated Show resolved Hide resolved
reconcile_test.go Outdated Show resolved Hide resolved
@kubawi kubawi requested review from a team as code owners November 28, 2024 20:25
… go on to carry out promotions in the same round
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this change will break a few things:

  • enterprise auto-upgrades
  • unassisted replacement style CE upgrades
  • RZ reconciliation when there are not enough healthy voters to promote.

The core issue is that its possible for autopilot to remain in a state where it wants to demote an unhealthy voter but where there are an even number of nodes. The state machine will be stuck until one of the following occurs:

  • An out-of-band change is made to the Raft configuration to alter suffrage.
  • Other servers are added that might need promotion and then having the promotions applied results in a cluster with an odd number of voters.

For auto-upgrades, its pretty much guaranteed during normal healthy operation to temporarily be in a state with an even number of voters. If at that point in time you want to demote one unhealthy server on the version being upgraded from, it would not happen and the upgrade would be stuck indefinitely.

For vanilla CE replacement style upgrades we are in a similar boat. If an old server becomes unhealthy as new servers are added you can end up with having an even number of voters, nothing left to promote but being unable to demote anything failed.

Consider the case where 3 new servers are added to an existing cluster of 3 voters with one being unhealthy. The suffrage would transition like so:

2/1 (old) -> 2/1 (old) + 3/0 (new) -> 0/1 (old) + 3/0 (new)

At that point we still have an even number of nodes, we still need to demote the old server (its never coming back).

I think the right solution then is to change the applyDemotions func to operate in one of two modes:

  1. Demote a single unhealthy voter (called when num voters is odd prior to any promotions)
  2. Demote everything as desired. (called only when there were no promotions to be done)

That is in contrast to the code as implemented which will operate in one of these two modes:

  1. Demote a single unhealthy voter (called when num voters is odd prior to any promotions)
  2. Demote all healthy servers as desired. (called only when there were no promotions to be done).

@banks
Copy link
Member

banks commented Dec 4, 2024

@mkeeler yeah I also picked up on the split between unhealthy and healthy as well as the 1 limit leading to asymmetric situations. Previously all the examples I'd thought up Kuba thought through and we found they would eventually work out OK as far as we can tell.

It sounds like you've thought of others that wouldn't although I've not totally understood them yet.

But in general, it seems much safer and easier to reason about with the change that you propose. To state it a different way:

  • When the number of voters is odd, use the new behaviour where at-most one single unhealthy server is demoted first, then promotions...
  • When the number of voters is even, use the same logic as before this PR, i.e. promote all then demote all.

Even then I still feel like the asymmetry of only demoting one unhealthy node on odd is strange. I think it's OK because it will make the cluster revert to "demote everything" on the next round (probably assuming no other cluster changes happen out of band). But it would still feel easier to reason about to me if we made this change the most minimal one it can be to solve the issue which seems to be:

  1. If Odd(numVoters):
    • attempt to demote at most 1 unhealthy
    • attempt promotions, if we do some return
    • attempt all other demotions that weren't that first unhealthy one
  2. If Even(numVoters):
    • Exactly the same behaviour as before the PR (all promotions, all demotions).

@kubawi what do you think? Was there some reason that the more minimal change was incorrect or more complex to implement?

@kubawi
Copy link
Contributor Author

kubawi commented Dec 5, 2024

@banks Embarrassingly enough, I cannot really recall why I picked this approach over a simpler one. It could have been that that's where my head went first and then I didn't find any reasons not to use this approach until Matt pointed out the stuck auto-upgrade scenario above.

@mkeeler Thanks a lot for the review. I completely missed this scenario! 🤦 I will update the PR shortly to follow your and Paul's recommendations.

@mkeeler
Copy link
Member

mkeeler commented Dec 5, 2024

@banks I might be able to describe the issue better.

First is that with auto-upgrades will always results in a state where there are an even number of voters. The auto-upgrade flow is generally:

  • idle -> await-new-voters
    • State transition happens once server with a newer version has been added.
    • This state is informational and is meant to signal to the operator that autopilot is still waiting for more new version servers to be added.
  • await-new-voters -> promoting
    • State transition happens once the number of healthy potential voters on the new version is greater than or equal to the number of voters on the old version.
    • Note that whether the old version voters are healthy doesn't matter.
    • When in this state, autopilot will be promoting the new version servers (still takes into account read replicas and redundancy zones so that not all new version servers are promoted)
  • promoting -> demoting
    • State transition happens once the number of voters on the new version is greater than or equal to the number of voters on the old version.
    • The raft changes the Promoter will ask autopilot to perform will be only demotions with no promotions.

If everything is healthy, the new code would:

  • not execute demotion of unhealthy voters because the server count is even already.
  • will not execute promotions because there are none being requested by the Promoter
  • will execute demotion of healthy voters on the old version. This will demote all 3 within one call to applyDemotions.

Now if there is one unhealthy voter on the old version then the new code would:

  • not execute demotion of unhealthy voters because the server count is even already.
  • will not execute promotions because there are none being requested by the Promoter
  • will execute demotion of healthy voters on the old version. This will demote 2 out of 3 within one call to applyDemotions because the remaining voter is unhealthy.

Now the cluster will be in a state with 3 healthy voters on the new version and 1 unhealthy voter on the old version. The next reconciliation round would result in:

  • not executing demotions of an unhealthy server because there are an even number of servers
  • not executing promotions because there are none.
  • demoting all healthy servers as requests. This will not actually demote anything due to there being no servers requested for demotion that are healthy.

Then auto-upgrades will be stuck in this state until the old server becomes healthy, or another voter gets added making the voter count odd and allowing the demotion to succeed.

@kubawi kubawi requested review from banks and mkeeler December 5, 2024 20:32
mkeeler
mkeeler previously approved these changes Dec 6, 2024
@kubawi kubawi changed the title Demote failed servers first during reconciliation Demote single failed server first during reconciliation with an odd number of voters Dec 6, 2024
banks
banks previously approved these changes Dec 6, 2024
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @kubawi, such awesome work.

I realised a few of these comments were from a while back but I forgot to submit them so I'm sorry for that. All optional code readability suggestions or no-op commentary of me convincing myself of things!

reconcile_test.go Outdated Show resolved Hide resolved
reconcile.go Show resolved Hide resolved
reconcile.go Outdated Show resolved Hide resolved
Comment on lines +175 to +176
// We only want to demote one failed server at a time to prevent violating the minimum quorum setting.
if demoteSingleFailedServer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we failed to demote it above? i.e. demoted = false 🤔 . I'm not sure under what circumstances that could happen.

I think the answer is that we should probably still limit our selves to 1 because the most likely way we could fail to demote I can think of is that we timeout or loose connection before we find out if it worked. In that case we can't assume that err != nil means that it actually failed so continuing to attempt another unhealthy demotion could violate the "only one" limit.

I guess the result is that we'd return the error stop the reconcile loop and then try again in 10 seconds.

Yeah overall I agree this is the most sensible apporach, just had to think it through!

Comment on lines +280 to +294
m.On("DemoteVoter",
raft.ServerID("33333333-3333-3333-3333-333333333333"),
uint64(0),
time.Duration(0),
).Return(&raftIndexFuture{}).Once()
m.On("DemoteVoter",
raft.ServerID("44444444-4444-4444-4444-444444444444"),
uint64(0),
time.Duration(0),
).Return(&raftIndexFuture{}).Once()
m.On("DemoteVoter",
raft.ServerID("55555555-5555-5555-5555-555555555555"),
uint64(0),
time.Duration(0),
).Return(&raftIndexFuture{}).Once()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked off to me at first to see us demoting a full half of the cluster in one round 😱 . But I guess this test is not setting a min_quorum so this is OK because it's actually doing each as a separate config change and reducing quorum as it goes, ending up leaving a cluster with 3 healthy servers in a quorum right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was my thinking. These would all be done in a single reconcile loop (this is the behaviour from before this PR anyways), but separate raft reconfigurations. Mainly though, I wanted to ensure that we have a test case asserting on multiple demotions, regardless of health status, happening in a single reconcile round that starts with an even number of voters.

…tions done in the same reconciliation round as the single demotion of a failed server
@kubawi kubawi dismissed stale reviews from banks and mkeeler via e7d2fc3 December 9, 2024 18:06
@kubawi kubawi requested review from banks and raskchanky and removed request for raskchanky December 9, 2024 18:10
banks
banks previously approved these changes Dec 10, 2024
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

All looking good to me. Great job Kuba!

raskchanky
raskchanky previously approved these changes Dec 11, 2024
reconcile.go Outdated Show resolved Hide resolved
reconcile.go Outdated Show resolved Hide resolved
Co-authored-by: Josh Black <[email protected]>
@kubawi kubawi dismissed stale reviews from raskchanky and banks via 4375751 December 11, 2024 10:45
Co-authored-by: Josh Black <[email protected]>
@kubawi kubawi requested a review from banks December 11, 2024 11:15
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good typo fixes!

@kubawi kubawi merged commit 5343f10 into master Dec 11, 2024
3 checks passed
@kubawi kubawi deleted the kubawi/VAULT-21282/demote-failed-servers-before-promotions-in-reconcile branch December 11, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants