Demote single failed server first during reconciliation with an odd number of voters #55

kubawi · 2024-09-11T17:12:27Z

This PR updates the autopilot reconciliation logic to demote a single unhealthy voter before carrying out promotions in an (Autopilot).reconcile call, when there is an odd number of voters and we at least one of them is unhealthy.

The purpose of this change is to avoid increasing the quorum when reconciling in presence of unhealthy voters, when the overall number of voters is odd. That temporary change of quorum makes the cluster vulnerable to failure in the scenario described below.

In Vault, we've encountered an issue where losing a voter, then a non-voter in the same redundancy zone in short succession, would lead to the cluster becoming unhealthy (losing quorum).

In the scenario above, the following would happen before the change in this PR (write-up stolen from @banks):

Given a cluster with 3 redundancy zones and 6 nodes.
When a voter fails it is “demoted” but what this actually means is that Autopilot will first promote a follower (with a raft reconfiguration), this increases the quorum from 2 to 3. We now have 4 voters, but only 3 of them are healthy.
At some time later—once the new voter is healthy—AP should remove the old voter and actually “demote” them to non-voter.
But if any of the healthy voters fail before the failed one is demoted, we end up with a quorum of 4 but only 2 available voters and so the leader is forced to step down, as it no longer has a majority of voters.

banks

Great job @kubawi!

This looks like a clean implementation of the approach discussed where we ensure that we demote unhealthy servers first.

This implementation also implicitly ensures that we only change by one node in a given raft config change (actually our Raft library ensures that by only supporting a single Add/Remove node call at once).

One thing I'm not clear on: can this code change add a new way to violate min_quorum i.e. demote more than f unhealthy servers before promoting their replacements? I think that is calculated elsewhere and this is just applying, but it's worth checking that those calculations don't implicitly rely on promotions always happening first (e.g. they might allow a demotion that would violate min_quorum because they also include a promotion and assume that would happen first?). That would be bad!

I'm even wondering if we need to do something more sophisticated even when min_quorum isn't set 🤔 .

For example, just a hypothetical case where there are 10 nodes across 5 RZs and so 5 voters..

RZ1	RZ2	RZ3	RZ4	RZ5
A*	B*	C*	D*	E*
F	G	H	I	J

If two zones went down at the same time (say 4 and 5) we could have D* and E* being demoted at the same time.

If we just demote them both first and then promote two others (say F and G), there is a point after the demotions are done where there are only 3 voters which has changed the quorum from 3 of 5 to 2 of 3 which is probably not what we want! I'm not sure if min_quorum being set to 3 in this case would prevent that but it should! But even assuming that min_quorum isn't set it would be better in this case if we instead did:

Demote E
Promote F
Demote D
Promote G

That way there would always be at least 4 voters in the configuration so we wouldn't change the quorum size.

The question is... would that re-introduce the same problem this is trying to fix if the health status of nodes is stale? In other words, what if the leader still sees I and J as healthy even though those zones have died?

Well then it would try:

Demote E : OK, committed by {A, B, C}, voters now {A, B, C, D}
Promote J: OK, committed by {A, B, C}, voters now {A, B, C, D, J} (but D and J are dead)
Demote D: OK committed by {A, B, C}, voters now {A, B, C, J}
Promote I: OK commited by {A, B, C}, voters now {A, B, C, I, J}

The cluster is still OK at this point since we still have 3 out of 5 voters healthy even though we just accidentally promoted two unhealthy servers.

Pretty soon we'd notice they are unhealthy and on a future round (say in 10 seconds) we'd demote I and J and promote two healthy standbys from other zones instead.

So I think this is safe to interleave the demotions and promotions but has the benefit that it will not reduce the actual quorum size at any point.

... although that's assuming that we started off with an odd number of voters, which should be the case in general because it doesn't make sense to deploy even numbers of voters with a majority quorum scheme. But it's possible we are in an interim state where there are an even number of voters so we should consider that case. This could happen because of misconfiguration, because we started with 5 zones and one died earlier and isn't back up yet etc.

Let's say our cluster was meant to be over 5 zones, but one failed and we've been operating for a while in the following state (*means voter). Lets also assume we have a bunch of other servers unheathy (strikethrough).

RZ1	RZ2	RZ3	RZ4	~~RZ5~~
A*	B*	C*	D*	E
F*	G	H	I	J

Now the same issue that has taken out G, H and I strikes F which becomes unhealthy. Next autopilot run we demote the unhealthy F but have no healthy candidates to promote (this is true both before and after this PR change).

The cluster is still up with 4 out of 5 voters! We could even tolerate one more failing. Current state:

RZ1	RZ2	RZ3	RZ4	~~RZ5~~
A*	B*	C*	D*	E
F	G	H	I	J

Now one of our standby nodes restarts and recovers and send a heartbeat to the leader just before the next AP reconcile. Sadly, we also loose RZ4 at exactly the same moment so now D is unhealthy. AP now sees:

RZ1	RZ2	RZ3	RZ4	~~RZ5~~
A*	B*	C*	D*	E
F	G	H	I	J

Before this PR, it would promote G first maintaining 4 voters, before demoting D.

Now (even if we did the alternating thing I suggested above) we would demote D taking the cluster down to only 3 voters and so shrinking the quorum size to 2. before promoting G which brings it back up. 🤔

I think we need to ensure that we can't violate min_quorum with this change - that seems to be the most important point. It's not obvious to me yet if we can or not.

The second part is even if min_quorum isn't set can we still try to avoid reducing quorum size when demoting unhealthy servers by only demoting one unhealthy server at a time and only if there is an odd number of voters to start with?

I think that would be the most conservative approach:

Rather than demoting all unhealthy first, demote at most one unhealthy voter each round before promotions (not sure if it's OK to promote more than one here though need to think)
Don't demote before promote if there is an even number of voters.

That would avoid introducing any new case where quorum size is changed while still solving the known issue of an RZ failure causing outage.

reconcile.go

…nly if there is an odd number of servers

banks

Looking great Kuba. A couple of thinkers inline. Not sure if they need changes yet but curious to hear what you and others think.

reconcile.go

reconcile_test.go

… go on to carry out promotions in the same round

mkeeler

I am pretty sure this change will break a few things:

enterprise auto-upgrades
unassisted replacement style CE upgrades
RZ reconciliation when there are not enough healthy voters to promote.

The core issue is that its possible for autopilot to remain in a state where it wants to demote an unhealthy voter but where there are an even number of nodes. The state machine will be stuck until one of the following occurs:

An out-of-band change is made to the Raft configuration to alter suffrage.
Other servers are added that might need promotion and then having the promotions applied results in a cluster with an odd number of voters.

For auto-upgrades, its pretty much guaranteed during normal healthy operation to temporarily be in a state with an even number of voters. If at that point in time you want to demote one unhealthy server on the version being upgraded from, it would not happen and the upgrade would be stuck indefinitely.

For vanilla CE replacement style upgrades we are in a similar boat. If an old server becomes unhealthy as new servers are added you can end up with having an even number of voters, nothing left to promote but being unable to demote anything failed.

Consider the case where 3 new servers are added to an existing cluster of 3 voters with one being unhealthy. The suffrage would transition like so:

2/1 (old) -> 2/1 (old) + 3/0 (new) -> 0/1 (old) + 3/0 (new)

At that point we still have an even number of nodes, we still need to demote the old server (its never coming back).

I think the right solution then is to change the applyDemotions func to operate in one of two modes:

Demote a single unhealthy voter (called when num voters is odd prior to any promotions)
Demote everything as desired. (called only when there were no promotions to be done)

That is in contrast to the code as implemented which will operate in one of these two modes:

Demote a single unhealthy voter (called when num voters is odd prior to any promotions)
Demote all healthy servers as desired. (called only when there were no promotions to be done).

banks · 2024-12-04T16:58:33Z

@mkeeler yeah I also picked up on the split between unhealthy and healthy as well as the 1 limit leading to asymmetric situations. Previously all the examples I'd thought up Kuba thought through and we found they would eventually work out OK as far as we can tell.

It sounds like you've thought of others that wouldn't although I've not totally understood them yet.

But in general, it seems much safer and easier to reason about with the change that you propose. To state it a different way:

When the number of voters is odd, use the new behaviour where at-most one single unhealthy server is demoted first, then promotions...
When the number of voters is even, use the same logic as before this PR, i.e. promote all then demote all.

Even then I still feel like the asymmetry of only demoting one unhealthy node on odd is strange. I think it's OK because it will make the cluster revert to "demote everything" on the next round (probably assuming no other cluster changes happen out of band). But it would still feel easier to reason about to me if we made this change the most minimal one it can be to solve the issue which seems to be:

If Odd(numVoters):
- attempt to demote at most 1 unhealthy
- attempt promotions, if we do some return
- attempt all other demotions that weren't that first unhealthy one
If Even(numVoters):
- Exactly the same behaviour as before the PR (all promotions, all demotions).

@kubawi what do you think? Was there some reason that the more minimal change was incorrect or more complex to implement?

kubawi · 2024-12-05T13:49:32Z

@banks Embarrassingly enough, I cannot really recall why I picked this approach over a simpler one. It could have been that that's where my head went first and then I didn't find any reasons not to use this approach until Matt pointed out the stuck auto-upgrade scenario above.

@mkeeler Thanks a lot for the review. I completely missed this scenario! 🤦 I will update the PR shortly to follow your and Paul's recommendations.

mkeeler · 2024-12-05T15:23:58Z

@banks I might be able to describe the issue better.

First is that with auto-upgrades will always results in a state where there are an even number of voters. The auto-upgrade flow is generally:

idle -> await-new-voters
- State transition happens once server with a newer version has been added.
- This state is informational and is meant to signal to the operator that autopilot is still waiting for more new version servers to be added.
await-new-voters -> promoting
- State transition happens once the number of healthy potential voters on the new version is greater than or equal to the number of voters on the old version.
- Note that whether the old version voters are healthy doesn't matter.
- When in this state, autopilot will be promoting the new version servers (still takes into account read replicas and redundancy zones so that not all new version servers are promoted)
promoting -> demoting
- State transition happens once the number of voters on the new version is greater than or equal to the number of voters on the old version.
- The raft changes the Promoter will ask autopilot to perform will be only demotions with no promotions.

If everything is healthy, the new code would:

not execute demotion of unhealthy voters because the server count is even already.
will not execute promotions because there are none being requested by the Promoter
will execute demotion of healthy voters on the old version. This will demote all 3 within one call to applyDemotions.

Now if there is one unhealthy voter on the old version then the new code would:

not execute demotion of unhealthy voters because the server count is even already.
will not execute promotions because there are none being requested by the Promoter
will execute demotion of healthy voters on the old version. This will demote 2 out of 3 within one call to applyDemotions because the remaining voter is unhealthy.

Now the cluster will be in a state with 3 healthy voters on the new version and 1 unhealthy voter on the old version. The next reconciliation round would result in:

not executing demotions of an unhealthy server because there are an even number of servers
not executing promotions because there are none.
demoting all healthy servers as requests. This will not actually demote anything due to there being no servers requested for demotion that are healthy.

Then auto-upgrades will be stuck in this state until the old server becomes healthy, or another voter gets added making the voter count odd and allowing the demotion to succeed.

…erything as desired, depending on the value of the demoteSingleFailedServer argument

banks

Yes @kubawi, such awesome work.

I realised a few of these comments were from a while back but I forgot to submit them so I'm sorry for that. All optional code readability suggestions or no-op commentary of me convincing myself of things!

reconcile_test.go

reconcile.go

banks · 2024-12-06T18:37:04Z

reconcile.go

+		// We only want to demote one failed server at a time to prevent violating the minimum quorum setting.
+		if demoteSingleFailedServer {


What if we failed to demote it above? i.e. demoted = false 🤔 . I'm not sure under what circumstances that could happen.

I think the answer is that we should probably still limit our selves to 1 because the most likely way we could fail to demote I can think of is that we timeout or loose connection before we find out if it worked. In that case we can't assume that err != nil means that it actually failed so continuing to attempt another unhealthy demotion could violate the "only one" limit.

I guess the result is that we'd return the error stop the reconcile loop and then try again in 10 seconds.

Yeah overall I agree this is the most sensible apporach, just had to think it through!

banks · 2024-12-06T18:41:56Z

reconcile_test.go

+				m.On("DemoteVoter",
+					raft.ServerID("33333333-3333-3333-3333-333333333333"),
+					uint64(0),
+					time.Duration(0),
+				).Return(&raftIndexFuture{}).Once()
+				m.On("DemoteVoter",
+					raft.ServerID("44444444-4444-4444-4444-444444444444"),
+					uint64(0),
+					time.Duration(0),
+				).Return(&raftIndexFuture{}).Once()
+				m.On("DemoteVoter",
+					raft.ServerID("55555555-5555-5555-5555-555555555555"),
+					uint64(0),
+					time.Duration(0),
+				).Return(&raftIndexFuture{}).Once()


It looked off to me at first to see us demoting a full half of the cluster in one round 😱 . But I guess this test is not setting a min_quorum so this is OK because it's actually doing each as a separate config change and reducing quorum as it goes, ending up leaving a cluster with 3 healthy servers in a quorum right?

Yep, that was my thinking. These would all be done in a single reconcile loop (this is the behaviour from before this PR anyways), but separate raft reconfigurations. Mainly though, I wanted to ensure that we have a test case asserting on multiple demotions, regardless of health status, happening in a single reconcile round that starts with an even number of voters.

…akeServerState (for test readability)

…tions done in the same reconciliation round as the single demotion of a failed server

banks

Nice!

All looking good to me. Great job Kuba!

…efore-promotions-in-reconcile

reconcile.go

Co-authored-by: Josh Black <[email protected]>

banks

Good typo fixes!

Demote failed servers first during reconciliation.

975d6be

kubawi requested review from raskchanky, banks and miagilepner September 11, 2024 17:12

kubawi self-assigned this Sep 11, 2024

kubawi changed the title ~~Demote failed servers first during reconciliation.~~ Demote failed servers first during reconciliation Sep 12, 2024

banks reviewed Sep 12, 2024

View reviewed changes

reconcile.go Outdated Show resolved Hide resolved

PR feedback - demote only one failed server per reconcile round and o…

57c7758

…nly if there is an odd number of servers

kubawi requested a review from banks November 18, 2024 11:46

kubawi marked this pull request as ready for review November 18, 2024 12:49

kubawi requested a review from a team as a code owner November 18, 2024 12:49

banks reviewed Nov 19, 2024

View reviewed changes

reconcile.go Outdated Show resolved Hide resolved

reconcile.go Outdated Show resolved Hide resolved

reconcile_test.go Outdated Show resolved Hide resolved

reconcile_test.go Outdated Show resolved Hide resolved

Simplify TestReconcile with a helper function to create server state

eb98ec1

kubawi requested review from a team as code owners November 28, 2024 20:25

Don't stop the reconciliation loop after demoting a failed server and…

165e61e

… go on to carry out promotions in the same round

mkeeler requested changes Dec 3, 2024

View reviewed changes

kubawi added 3 commits December 5, 2024 19:36

Adjust applyDemotions to either demote a single unhealthy voter or ev…

d1f8e2f

…erything as desired, depending on the value of the demoteSingleFailedServer argument

Fix a comment to adhere to latest changes in code

09e9ef2

Another comment fix

9d9c9d6

kubawi requested review from banks and mkeeler December 5, 2024 20:32

mkeeler previously approved these changes Dec 6, 2024

View reviewed changes

kubawi changed the title ~~Demote failed servers first during reconciliation~~ Demote single failed server first during reconciliation with an odd number of voters Dec 6, 2024

banks previously approved these changes Dec 6, 2024

View reviewed changes

kubawi added 2 commits December 9, 2024 14:52

Use a new type and constants to represent the health of a server in m…

413b500

…akeServerState (for test readability)

Add a comment noting that we actively want to keep going to get promo…

e7d2fc3

…tions done in the same reconciliation round as the single demotion of a failed server

kubawi dismissed stale reviews from banks and mkeeler via e7d2fc3 December 9, 2024 18:06

kubawi requested review from banks and raskchanky and removed request for raskchanky December 9, 2024 18:10

banks previously approved these changes Dec 10, 2024

View reviewed changes

Merge branch 'master' into kubawi/VAULT-21282/demote-failed-servers-b…

e0f4167

…efore-promotions-in-reconcile

raskchanky previously approved these changes Dec 11, 2024

View reviewed changes

reconcile.go Outdated Show resolved Hide resolved

reconcile.go Outdated Show resolved Hide resolved

Update reconcile.go

4375751

Co-authored-by: Josh Black <[email protected]>

kubawi dismissed stale reviews from raskchanky and banks via 4375751 December 11, 2024 10:45

Update reconcile.go

01c1131

Co-authored-by: Josh Black <[email protected]>

kubawi requested a review from banks December 11, 2024 11:15

banks approved these changes Dec 11, 2024

View reviewed changes

kubawi merged commit 5343f10 into master Dec 11, 2024
3 checks passed

kubawi deleted the kubawi/VAULT-21282/demote-failed-servers-before-promotions-in-reconcile branch December 11, 2024 14:11

kubawi mentioned this pull request Jan 7, 2025

[VAULT-21282] Bump raft-autopilot dependency to v0.3.0 hashicorp/vault#29306

Merged

6 tasks

This was referenced Feb 17, 2025

Backport of [VAULT-21282] Bump raft-autopilot dependency to v0.3.0 into release/1.18.x hashicorp/vault#29655

Merged

Backport of [VAULT-21282] Bump raft-autopilot dependency to v0.3.0 into release/1.19.x hashicorp/vault#29656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demote single failed server first during reconciliation with an odd number of voters #55

Demote single failed server first during reconciliation with an odd number of voters #55

kubawi commented Sep 11, 2024 •

edited

Loading

banks left a comment

banks left a comment

mkeeler left a comment

banks commented Dec 4, 2024

kubawi commented Dec 5, 2024

mkeeler commented Dec 5, 2024

banks left a comment

banks Dec 6, 2024

banks Dec 6, 2024

kubawi Dec 9, 2024

banks left a comment

banks left a comment

		// We only want to demote one failed server at a time to prevent violating the minimum quorum setting.
		if demoteSingleFailedServer {

Demote single failed server first during reconciliation with an odd number of voters #55

Demote single failed server first during reconciliation with an odd number of voters #55

Conversation

kubawi commented Sep 11, 2024 • edited Loading

banks left a comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment

banks commented Dec 4, 2024

kubawi commented Dec 5, 2024

mkeeler commented Dec 5, 2024

banks left a comment

Choose a reason for hiding this comment

banks Dec 6, 2024

Choose a reason for hiding this comment

banks Dec 6, 2024

Choose a reason for hiding this comment

kubawi Dec 9, 2024

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

kubawi commented Sep 11, 2024 •

edited

Loading