Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport
This PR is auto-generated from #15068 to be assessed for backporting due to the inclusion of the label backport/1.4.x.
The below text is copied from the body of the original PR.
When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat the leader updates its status to
disconnected
which results in evals for all jobs that have an allocation in that client. This eval is reconciled by the scheduler, and updates the allocClientStatus
tounknown
and appends a new state value to indicate the allocation is considered disconnected.When the client reconnects again its status is set back to
ready
since the heartbeat succeeds. TheClientStatus
of the allocation is stillunknown
as it may have failed while the client was disconnected. This status will only be updated once the client calls theNode.UpdateAlloc
against a server, which will overwrite theunknown
ClientStatus
value set in the server with the correct status.From the leader perspective this is what happens:
The problem described in #14925 happens because any eval after this disconnect/reconnect flow happens (such as a job update) is essentially indistinguishable from the last eval in the diagram above: the node is
ready
and the allocClientStatus
isrunning
, but these two scenarios need to be handle differently and so we need to store something in state to be able to detect this difference.Current implementation uses the presence of a
TaskClientReconnected
task event to detect if an alloc needs to reconnect, but this event is still present even after the alloc reconnects, so the scheduler always consider the alloc as still reconnecting.While testing the fix for #14925 I would often hit #12680, which prevented the issue from being triggered since the extra alloc would not be considered as "reconnecting", so I included a fix for #12680 in this PR as well since their root cause is similar.
Clients use three main RPC methods to communicate its state with servers:
Node.GetAllocs
reads allocation data from the server and writes it to the client.Node.UpdateAlloc
reads allocation from from the client and writes them to the server.Node.UpdateStatus
writes the client status to the server and is used as the heartbeat mechanism.These RPC methods are called periodically by the client, and independently from each other. The usual mental model is that clients heartbeat first with
Node.UpdateStatus
and then update their allocation data withNode.UpdateAlloc
, but this is not always true. If these two operations are reversed the diagram above will look like this:The state at the second last eval (
Node updates allocs
) looks just like the state when the node missed its heartbeat (Node misses heartbeat
):Node.Status: disconnected
andAlloc.ClientStatus: running
.This PR addresses these problems in the following ways:
AllocState
entry to keep it consistent with how disconnects are recordedAllocState
, it it'sunknown
the alloc reconnect has not been processed yetThe commits are split by different work chunks:
running
allocs in disconnected clientsThis is an alternative fix for #14948.
Closes #14925
Closes #12680