Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

hc-github-team-nomad-core · 2022-11-04T20:26:03Z

Backport

This PR is auto-generated from #15068 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.

When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat the leader updates its status to disconnected which results in evals for all jobs that have an allocation in that client. This eval is reconciled by the scheduler, and updates the alloc ClientStatus to unknown and appends a new state value to indicate the allocation is considered disconnected.

When the client reconnects again its status is set back to ready since the heartbeat succeeds. The ClientStatus of the allocation is still unknown as it may have failed while the client was disconnected. This status will only be updated once the client calls the Node.UpdateAlloc against a server, which will overwrite the unknown ClientStatus value set in the server with the correct status.

From the leader perspective this is what happens:

The problem described in #14925 happens because any eval after this disconnect/reconnect flow happens (such as a job update) is essentially indistinguishable from the last eval in the diagram above: the node is ready and the alloc ClientStatus is running, but these two scenarios need to be handle differently and so we need to store something in state to be able to detect this difference.

Current implementation uses the presence of a TaskClientReconnected task event to detect if an alloc needs to reconnect, but this event is still present even after the alloc reconnects, so the scheduler always consider the alloc as still reconnecting.

While testing the fix for #14925 I would often hit #12680, which prevented the issue from being triggered since the extra alloc would not be considered as "reconnecting", so I included a fix for #12680 in this PR as well since their root cause is similar.

Clients use three main RPC methods to communicate its state with servers:

Node.GetAllocs reads allocation data from the server and writes it to the client.
Node.UpdateAlloc reads allocation from from the client and writes them to the server.
Node.UpdateStatus writes the client status to the server and is used as the heartbeat mechanism.

These RPC methods are called periodically by the client, and independently from each other. The usual mental model is that clients heartbeat first with Node.UpdateStatus and then update their allocation data with Node.UpdateAlloc, but this is not always true. If these two operations are reversed the diagram above will look like this:

The state at the second last eval (Node updates allocs) looks just like the state when the node missed its heartbeat (Node misses heartbeat): Node.Status: disconnected and Alloc.ClientStatus: running.

This PR addresses these problems in the following ways:

Don't rely on task events since only a limited number of them are kept in state
Record when a allocation reconnects as an AllocState entry to keep it consistent with how disconnects are recorded
Check if the alloc needs to reconnect by verifying the last AllocState, it it's unknown the alloc reconnect has not been processed yet
Enforce that nodes register and heartbeat successfully before being able to update their allocs. This makes the reconnect flow easier to reason about and more predictable.

The commits are split by different work chunks:

882904b is the main fix for Job updates not applied to reconnected allocations #14925, both for system and service jobs.
720513f fixes disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680 by properly handling running allocs in disconnected clients
956a3fa enforces clients to heartbeat before they can update their allocs. While 720513f would be enough to fix disconnected clients: Reconnecting allocation can cause a replacement that is immediately stopped #12680 I noticed other edge cases that I was not able to fix in the scheduler.

This is an alternative fix for #14948.

Closes #14925
Closes #12680

github-actions · 2023-03-05T02:24:01Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lgfa29 added 12 commits November 2, 2022 02:08

backport of commit 882904b

71691d5

backport of commit 720513f

0ce9e78

backport of commit 956a3fa

56787f5

backport of commit 3e081c1

6c67191

backport of commit aa324ad

46e09e1

backport of commit f5ce8a9

3eb69ef

backport of commit 12f1ff5

a837a8d

backport of commit cb51a28

76d1e01

backport of commit e57379c

eded639

backport of commit 23613c5

ffd5fd4

backport of commit da8aee4

c09925e

backport of commit 48bfdc4

6b88ce1

hc-github-team-nomad-core force-pushed the backport/b-update-reconnected-alloc/hugely-integral-pug branch from ddcfb22 to 6b88ce1 Compare November 4, 2022 20:26

hc-github-team-nomad-core merged commit 54a25a5 into release/1.4.x Nov 4, 2022

hc-github-team-nomad-core deleted the backport/b-update-reconnected-alloc/hugely-integral-pug branch November 4, 2022 20:26

vercel bot deployed to Preview – nomad-storybook-and-ui November 4, 2022 20:32 View deployment

github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

hc-github-team-nomad-core commented Nov 4, 2022

github-actions bot commented Mar 5, 2023

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

Conversation

hc-github-team-nomad-core commented Nov 4, 2022

Backport

github-actions bot commented Mar 5, 2023