-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update alloc after reconnect and enforece client heartbeat order #15068
Conversation
8c02d32
to
3d27f43
Compare
3d27f43
to
dd461c7
Compare
dd461c7
to
182d660
Compare
182d660
to
f09eaf1
Compare
f09eaf1
to
ac0f6fc
Compare
1db5842
to
56341f0
Compare
56341f0
to
aec873d
Compare
aec873d
to
e9d7a56
Compare
When an allocation reconnects to a cluster the scheduler needs to run special logic to handle the reconnection, check if a replacement was create and stop one of them. If the allocation kept running while the node was disconnected, it will be reconnected with `ClientStatus: running` and the node will have `Status: ready`. This combination is the same as the normal steady state of allocation, where everything is running as expected. In order to differentiate between the two states (an allocation that is reconnecting and one that is just running) the scheduler needs an extra piece of state. The current implementation uses the presence of a `TaskClientReconnected` task event to detect when the allocation has reconnected and thus must go through the reconnection process. But this event remains even after the allocation is reconnected, causing all future evals to consider the allocation as still reconnecting. This commit changes the reconnect logic to use an `AllocState` to register when the allocation was reconnected. This provides the following benefits: - Only a limited number of task states are kept, and they are used for many other events. It's possible that, upon reconnecting, several actions are triggered that could cause the `TaskClientReconnected` event to be dropped. - Task events are set by clients and so their timestamps are subject to time skew from servers. This prevents using time to determine if an allocation reconnected after a disconnect event. - Disconnect events are already stored as `AllocState` and so storing reconnects there as well makes it the only source of information required. With the new logic, the reconnection logic is only triggered if the last `AllocState` is a disconnect event, meaning that the allocation has not been reconnected yet. After the reconnection is handled, the new `ClientStatus` is store in `AllocState` allowing future evals to skip the reconnection logic.
When a client reconnects it makes two independent RPC calls: - `Node.UpdateStatus` to heartbeat and set its status as `ready`. - `Node.UpdateAlloc` to update the status of its allocations. These two calls can happen in any order, and in case the allocations are updated before a heartbeat it causes the state to be the same as a node being disconnected: the node status will still be `disconnected` while the allocation `ClientStatus` is set to `running`. The current implementation did not handle this order of events properly, and the scheduler would create an unnecessary placement since it considered the allocation was being disconnected. This extra allocation would then be quickly stopped by the heartbeat eval. This commit adds a new code path to handle this order of events. If the node is `disconnected` and the allocation `ClientStatus` is `running` the scheduler will check if the allocation is actually reconnecting using its `AllocState` events.
Clients interact with servers using three main RPC methods: - `Node.GetAllocs` reads allocation data from the server and writes it to the client. - `Node.UpdateAlloc` reads allocation from from the client and writes them to the server. - `Node.UpdateStatus` writes the client status to the server and is used as the heartbeat mechanism. These three methods are called periodically by the clients and are done so independently from each other, meaning that there can't be any assumptions in their ordering. This can generate scenarios that are hard to reason about and to code for. For example, when a client misses too many heartbeats it will be considered `down` or `disconnected` and the allocations it was running are set to `lost` or `unknown`. When connectivity is restored the to rest of the cluster, the natural mental model is to think that the client will heartbeat first and then update its allocations status into the servers. But since there's no inherit order in these calls the reverse is just as possible: the client updates the alloc status and then heartbeats. This results in a state where allocs are, for example, `running` while the client is still `disconnected`. This commit adds a new verification to the `Node.UpdateAlloc` method to reject updates from nodes that are not `ready`, forcing clients to heartbeat first. Since this check is done server-side there is no need to coordinate operations client-side: they can continue sending these requests independently and alloc update will succeed after the heartbeat is done.
e9d7a56
to
956a3fa
Compare
for i := len(a.AllocStates) - 1; i >= 0; i-- { | ||
s := a.AllocStates[i] | ||
if s.Field != AllocStateFieldClientStatus { | ||
continue | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that states are order chronologically, which I think it's fine since entries are added using alloc.AppendState
? Another option would be to traverse all values.
When the client reconnects with the server it synchronizes the state of its allocations by sending data using the `Node.UpdateAlloc` RPC and fetching data using the `Node.GetClientAllocs` RPC. If the data fetch happens before the data write, `unknown` allocations will still be in this state and would trigger the `allocRunner.Reconnect` flow. But when the server `DesiredStatus` for the allocation is `stop` the client should not reconnect the allocation.
More race condition fun: f5ce8a9 😄 This time between alloc reads and writes in the client. If the client reads the alloc state before writing it, it will trigger the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the description:
When a Nomad client misses a heartbeat the leader updates its status to disconnected
should probably read:
When a Nomad client with at least one alloc with
max_client_disconnect
misses a heartbeat...
Clients without an alloc with max_client_disconnect
set will transition to down
and allocs to lost
. Minor pedantic point, but I suspect we may be linking to this PR in the future to explain this behavior so I think it's worth being precise. 😁
Don't rely on task events since only a limited number of them are kept in state
🎉 I'm really happy to be keeping task events for humans only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely fantastic work from the repro to the description to the code! What a complex situation to debug and fixup.
An e2e test would be nice since there's so much emergent nondeterministic behavior at play. Your unit tests are great but come with some assumptions about the state of the data being tested that may not cover everything that can happen at runtime.
I forget if our Test{Agent,Server}
make it possible to write an integration test which at least exercises these changes with a real client.Client
.
The obvious problem with an e2e testing is how to simulate a disconnected client. Do I have a terrible idea for you! 😅
Schedule count=2 nomad agents via raw exec with the "fake" nomad's datacenter = your-test-name
, and then schedule a job with max_client_disconnect
set that targets that datacenter.
Grab the node ID of which of the 2 your alloc got scheduled on, send it a kill signal, let it restart via the parent Nomad, and then assert this intended behavior.
I think the nested-nomad agent should work, and the unique datacenter trick should prevent any flakiness from other jobs colliding with your careful placement scheme.
Does not block merging this though. Probably best as a followup.
scheduler/reconcile_util.go
Outdated
// Handle the case where the client updates its allocs | ||
// before sending a heartbeat. The alloc will be set to | ||
// running, but the client is still considered to be | ||
// disconnected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was prevented by the node readiness check added in Node.UpdateAlloc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the RPC update is enough for this particular issue, but I was wondering if there was another scenario that we don't know about that could trigger this condition (like some data corruption? or bad alloc in state?) and so I kept the reconciler fix as well as a kind of layered defense.
But I can remove this as well, less code to think about? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got really confused by this while reading the comments for the test cases in TestAllocSet_filterByTainted
and TestReconciler_Disconnected_Client
, which suggest it's still possible to hit this case. Two years from now someone will read that and wonder if the test is stale. 😀
I wouldn't be opposed to removing this extra code and tests but if we do leave it, I'd recommend making it clear in the docstrings that this is a case that should never happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I meant by the layered approach. The RPC change fixes things externally but without this change the reconciler can still in theory generate an invalid result given a these specific conditions.
I would've liked to be able to fix everything in the scheduler, but I found one case for system jobs that I couldn't and only the RPC ordering fixed it, and I suspect there may be other edge cases like this, so I kept both approaches.
But I agree that a comment stating this should not happen would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sold now 😄
Adding code to the scheduler that we don't expect to hit just add stuff to an already complex part of the code, so I removed this check and its equivalent for system
scheduling. Thanks for the feedback!
I will leave this thread unresolved just in case we come back to this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finished reviewing yet but wanted to surface my comments before I wrap up for the day 😀 . Will pick this back up first thing tomorrow.
scheduler/reconcile_util.go
Outdated
// Handle the case where the client updates its allocs | ||
// before sending a heartbeat. The alloc will be set to | ||
// running, but the client is still considered to be | ||
// disconnected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got really confused by this while reading the comments for the test cases in TestAllocSet_filterByTainted
and TestReconciler_Disconnected_Client
, which suggest it's still possible to hit this case. Two years from now someone will read that and wonder if the test is stale. 😀
I wouldn't be opposed to removing this extra code and tests but if we do leave it, I'd recommend making it clear in the docstrings that this is a case that should never happen.
Edit: fixed in cb51a28. 12f1ff5#diff-97e6d5511d191b7f15c08c580b602f02f7d22cfc2879f1a7d0a284c206b1c549 prevented the state store from being incorrectly mutated, so the update alloc needs to be sent to the plan applier to persist the changes. I should've noticed this since I had to do the same thing in the system scheduler 😬 |
Reconnected allocs have a new AllocState entry that must be persisted by the plan applier.
The AllocUpdateRequest struct is used in three disjoint use cases: 1. Stripped allocs from clients Node.UpdateAlloc RPC using the Allocs, and WriteRequest fields 2. Raft log message using the Allocs, Evals, and WriteRequest fields 3. Plan updates using the AllocsStopped, AllocsUpdated, and Job fields Adding a new field that would only be used in one these cases (1) made things more confusing and error prone. While in theory an AllocUpdateRequest could send allocations from different nodes, in practice this never actually happens since only clients call this method with their own allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (Once the docstrings are fixed up as being discussed)
This condition could only be hit if, somehow, the allocation status was set to "running" while the client was "unknown". This was addressed by enforcing an order in "Node.UpdateStatus" and "Node.UpdateAlloc" RPC calls, so this scenario is not expected to happen. Adding unnecessary code to the scheduler makes it harder to read and reason about it.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat the leader updates its status to
disconnected
which results in evals for all jobs that have an allocation in that client. This eval is reconciled by the scheduler, and updates the allocClientStatus
tounknown
and appends a new state value to indicate the allocation is considered disconnected.When the client reconnects again its status is set back to
ready
since the heartbeat succeeds. TheClientStatus
of the allocation is stillunknown
as it may have failed while the client was disconnected. This status will only be updated once the client calls theNode.UpdateAlloc
against a server, which will overwrite theunknown
ClientStatus
value set in the server with the correct status.From the leader perspective this is what happens:
The problem described in #14925 happens because any eval after this disconnect/reconnect flow happens (such as a job update) is essentially indistinguishable from the last eval in the diagram above: the node is
ready
and the allocClientStatus
isrunning
, but these two scenarios need to be handle differently and so we need to store something in state to be able to detect this difference.Current implementation uses the presence of a
TaskClientReconnected
task event to detect if an alloc needs to reconnect, but this event is still present even after the alloc reconnects, so the scheduler always consider the alloc as still reconnecting.While testing the fix for #14925 I would often hit #12680, which prevented the issue from being triggered since the extra alloc would not be considered as "reconnecting", so I included a fix for #12680 in this PR as well since their root cause is similar.
Clients use three main RPC methods to communicate its state with servers:
Node.GetAllocs
reads allocation data from the server and writes it to the client.Node.UpdateAlloc
reads allocation from from the client and writes them to the server.Node.UpdateStatus
writes the client status to the server and is used as the heartbeat mechanism.These RPC methods are called periodically by the client, and independently from each other. The usual mental model is that clients heartbeat first with
Node.UpdateStatus
and then update their allocation data withNode.UpdateAlloc
, but this is not always true. If these two operations are reversed the diagram above will look like this:The state at the second last eval (
Node updates allocs
) looks just like the state when the node missed its heartbeat (Node misses heartbeat
):Node.Status: disconnected
andAlloc.ClientStatus: running
.This PR addresses these problems in the following ways:
AllocState
entry to keep it consistent with how disconnects are recordedAllocState
, it it'sunknown
the alloc reconnect has not been processed yetThe commits are split by different work chunks:
running
allocs in disconnected clientsThis is an alternative fix for #14948.
Closes #14925
Closes #12680