-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demote single failed server first during reconciliation with an odd number of voters #55
Demote single failed server first during reconciliation with an odd number of voters #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @kubawi!
This looks like a clean implementation of the approach discussed where we ensure that we demote unhealthy servers first.
This implementation also implicitly ensures that we only change by one node in a given raft config change (actually our Raft library ensures that by only supporting a single Add/Remove node call at once).
One thing I'm not clear on: can this code change add a new way to violate min_quorum
i.e. demote more than f
unhealthy servers before promoting their replacements? I think that is calculated elsewhere and this is just applying, but it's worth checking that those calculations don't implicitly rely on promotions always happening first (e.g. they might allow a demotion that would violate min_quorum because they also include a promotion and assume that would happen first?). That would be bad!
I'm even wondering if we need to do something more sophisticated even when min_quorum
isn't set 🤔 .
For example, just a hypothetical case where there are 10 nodes across 5 RZs and so 5 voters..
RZ1 | RZ2 | RZ3 | RZ4 | RZ5 |
---|---|---|---|---|
A* | B* | C* | D* | E* |
F | G | H | I | J |
If two zones went down at the same time (say 4 and 5) we could have D* and E* being demoted at the same time.
If we just demote them both first and then promote two others (say F and G), there is a point after the demotions are done where there are only 3 voters which has changed the quorum from 3 of 5 to 2 of 3 which is probably not what we want! I'm not sure if min_quorum
being set to 3 in this case would prevent that but it should! But even assuming that min_quorum
isn't set it would be better in this case if we instead did:
- Demote E
- Promote F
- Demote D
- Promote G
That way there would always be at least 4 voters in the configuration so we wouldn't change the quorum size.
The question is... would that re-introduce the same problem this is trying to fix if the health status of nodes is stale? In other words, what if the leader still sees I and J as healthy even though those zones have died?
Well then it would try:
- Demote E : OK, committed by {A, B, C}, voters now {A, B, C, D}
- Promote J: OK, committed by {A, B, C}, voters now {A, B, C, D, J} (but D and J are dead)
- Demote D: OK committed by {A, B, C}, voters now {A, B, C, J}
- Promote I: OK commited by {A, B, C}, voters now {A, B, C, I, J}
The cluster is still OK at this point since we still have 3 out of 5 voters healthy even though we just accidentally promoted two unhealthy servers.
Pretty soon we'd notice they are unhealthy and on a future round (say in 10 seconds) we'd demote I and J and promote two healthy standbys from other zones instead.
So I think this is safe to interleave the demotions and promotions but has the benefit that it will not reduce the actual quorum size at any point.
... although that's assuming that we started off with an odd number of voters, which should be the case in general because it doesn't make sense to deploy even numbers of voters with a majority quorum scheme. But it's possible we are in an interim state where there are an even number of voters so we should consider that case. This could happen because of misconfiguration, because we started with 5 zones and one died earlier and isn't back up yet etc.
Let's say our cluster was meant to be over 5 zones, but one failed and we've been operating for a while in the following state (*means voter). Lets also assume we have a bunch of other servers unheathy (strikethrough).
RZ1 | RZ2 | RZ3 | RZ4 | |
---|---|---|---|---|
A* | B* | C* | D* | |
F* |
Now the same issue that has taken out G, H and I strikes F which becomes unhealthy. Next autopilot run we demote the unhealthy F but have no healthy candidates to promote (this is true both before and after this PR change).
The cluster is still up with 4 out of 5 voters! We could even tolerate one more failing. Current state:
RZ1 | RZ2 | RZ3 | RZ4 | |
---|---|---|---|---|
A* | B* | C* | D* | |
Now one of our standby nodes restarts and recovers and send a heartbeat to the leader just before the next AP reconcile. Sadly, we also loose RZ4 at exactly the same moment so now D is unhealthy. AP now sees:
RZ1 | RZ2 | RZ3 | RZ4 | |
---|---|---|---|---|
A* | B* | C* | ||
G |
Before this PR, it would promote G first maintaining 4 voters, before demoting D.
Now (even if we did the alternating thing I suggested above) we would demote D taking the cluster down to only 3 voters and so shrinking the quorum size to 2. before promoting G which brings it back up. 🤔
I think we need to ensure that we can't violate min_quorum
with this change - that seems to be the most important point. It's not obvious to me yet if we can or not.
The second part is even if min_quorum
isn't set can we still try to avoid reducing quorum size when demoting unhealthy servers by only demoting one unhealthy server at a time and only if there is an odd number of voters to start with?
I think that would be the most conservative approach:
- Rather than demoting all unhealthy first, demote at most one unhealthy voter each round before promotions (not sure if it's OK to promote more than one here though need to think)
- Don't demote before promote if there is an even number of voters.
That would avoid introducing any new case where quorum size is changed while still solving the known issue of an RZ failure causing outage.
…nly if there is an odd number of servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great Kuba. A couple of thinkers inline. Not sure if they need changes yet but curious to hear what you and others think.
… go on to carry out promotions in the same round
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure this change will break a few things:
- enterprise auto-upgrades
- unassisted replacement style CE upgrades
- RZ reconciliation when there are not enough healthy voters to promote.
The core issue is that its possible for autopilot to remain in a state where it wants to demote an unhealthy voter but where there are an even number of nodes. The state machine will be stuck until one of the following occurs:
- An out-of-band change is made to the Raft configuration to alter suffrage.
- Other servers are added that might need promotion and then having the promotions applied results in a cluster with an odd number of voters.
For auto-upgrades, its pretty much guaranteed during normal healthy operation to temporarily be in a state with an even number of voters. If at that point in time you want to demote one unhealthy server on the version being upgraded from, it would not happen and the upgrade would be stuck indefinitely.
For vanilla CE replacement style upgrades we are in a similar boat. If an old server becomes unhealthy as new servers are added you can end up with having an even number of voters, nothing left to promote but being unable to demote anything failed.
Consider the case where 3 new servers are added to an existing cluster of 3 voters with one being unhealthy. The suffrage would transition like so:
2/1 (old) -> 2/1 (old) + 3/0 (new) -> 0/1 (old) + 3/0 (new)
At that point we still have an even number of nodes, we still need to demote the old server (its never coming back).
I think the right solution then is to change the applyDemotions
func to operate in one of two modes:
- Demote a single unhealthy voter (called when num voters is odd prior to any promotions)
- Demote everything as desired. (called only when there were no promotions to be done)
That is in contrast to the code as implemented which will operate in one of these two modes:
- Demote a single unhealthy voter (called when num voters is odd prior to any promotions)
- Demote all healthy servers as desired. (called only when there were no promotions to be done).
@mkeeler yeah I also picked up on the split between unhealthy and healthy as well as the 1 limit leading to asymmetric situations. Previously all the examples I'd thought up Kuba thought through and we found they would eventually work out OK as far as we can tell. It sounds like you've thought of others that wouldn't although I've not totally understood them yet. But in general, it seems much safer and easier to reason about with the change that you propose. To state it a different way:
Even then I still feel like the asymmetry of only demoting one unhealthy node on odd is strange. I think it's OK because it will make the cluster revert to "demote everything" on the next round (probably assuming no other cluster changes happen out of band). But it would still feel easier to reason about to me if we made this change the most minimal one it can be to solve the issue which seems to be:
@kubawi what do you think? Was there some reason that the more minimal change was incorrect or more complex to implement? |
@banks Embarrassingly enough, I cannot really recall why I picked this approach over a simpler one. It could have been that that's where my head went first and then I didn't find any reasons not to use this approach until Matt pointed out the stuck auto-upgrade scenario above. @mkeeler Thanks a lot for the review. I completely missed this scenario! 🤦 I will update the PR shortly to follow your and Paul's recommendations. |
@banks I might be able to describe the issue better. First is that with auto-upgrades will always results in a state where there are an even number of voters. The auto-upgrade flow is generally:
If everything is healthy, the new code would:
Now if there is one unhealthy voter on the old version then the new code would:
Now the cluster will be in a state with 3 healthy voters on the new version and 1 unhealthy voter on the old version. The next reconciliation round would result in:
Then auto-upgrades will be stuck in this state until the old server becomes healthy, or another voter gets added making the voter count odd and allowing the demotion to succeed. |
…erything as desired, depending on the value of the demoteSingleFailedServer argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @kubawi, such awesome work.
I realised a few of these comments were from a while back but I forgot to submit them so I'm sorry for that. All optional code readability suggestions or no-op commentary of me convincing myself of things!
// We only want to demote one failed server at a time to prevent violating the minimum quorum setting. | ||
if demoteSingleFailedServer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we failed to demote it above? i.e. demoted = false
🤔 . I'm not sure under what circumstances that could happen.
I think the answer is that we should probably still limit our selves to 1 because the most likely way we could fail to demote I can think of is that we timeout or loose connection before we find out if it worked. In that case we can't assume that err != nil
means that it actually failed so continuing to attempt another unhealthy demotion could violate the "only one" limit.
I guess the result is that we'd return the error stop the reconcile loop and then try again in 10 seconds.
Yeah overall I agree this is the most sensible apporach, just had to think it through!
m.On("DemoteVoter", | ||
raft.ServerID("33333333-3333-3333-3333-333333333333"), | ||
uint64(0), | ||
time.Duration(0), | ||
).Return(&raftIndexFuture{}).Once() | ||
m.On("DemoteVoter", | ||
raft.ServerID("44444444-4444-4444-4444-444444444444"), | ||
uint64(0), | ||
time.Duration(0), | ||
).Return(&raftIndexFuture{}).Once() | ||
m.On("DemoteVoter", | ||
raft.ServerID("55555555-5555-5555-5555-555555555555"), | ||
uint64(0), | ||
time.Duration(0), | ||
).Return(&raftIndexFuture{}).Once() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked off to me at first to see us demoting a full half of the cluster in one round 😱 . But I guess this test is not setting a min_quorum so this is OK because it's actually doing each as a separate config change and reducing quorum as it goes, ending up leaving a cluster with 3 healthy servers in a quorum right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that was my thinking. These would all be done in a single reconcile loop (this is the behaviour from before this PR anyways), but separate raft reconfigurations. Mainly though, I wanted to ensure that we have a test case asserting on multiple demotions, regardless of health status, happening in a single reconcile round that starts with an even number of voters.
…akeServerState (for test readability)
…tions done in the same reconciliation round as the single demotion of a failed server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
All looking good to me. Great job Kuba!
…efore-promotions-in-reconcile
Co-authored-by: Josh Black <[email protected]>
Co-authored-by: Josh Black <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good typo fixes!
This PR updates the autopilot reconciliation logic to demote a single unhealthy voter before carrying out promotions in an
(Autopilot).reconcile
call, when there is an odd number of voters and we at least one of them is unhealthy.The purpose of this change is to avoid increasing the quorum when reconciling in presence of unhealthy voters, when the overall number of voters is odd. That temporary change of quorum makes the cluster vulnerable to failure in the scenario described below.
In Vault, we've encountered an issue where losing a voter, then a non-voter in the same redundancy zone in short succession, would lead to the cluster becoming unhealthy (losing quorum).
In the scenario above, the following would happen before the change in this PR (write-up stolen from @banks):