Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Update alloc after reconnect and enforece client heartbeat order into release/1.4.x #15153

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #15068 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.


When a Nomad client with at least one alloc with max_client_disconnect misses a heartbeat the leader updates its status to disconnected which results in evals for all jobs that have an allocation in that client. This eval is reconciled by the scheduler, and updates the alloc ClientStatus to unknown and appends a new state value to indicate the allocation is considered disconnected.

When the client reconnects again its status is set back to ready since the heartbeat succeeds. The ClientStatus of the allocation is still unknown as it may have failed while the client was disconnected. This status will only be updated once the client calls the Node.UpdateAlloc against a server, which will overwrite the unknown ClientStatus value set in the server with the correct status.

From the leader perspective this is what happens:

node-disconnect drawio

The problem described in #14925 happens because any eval after this disconnect/reconnect flow happens (such as a job update) is essentially indistinguishable from the last eval in the diagram above: the node is ready and the alloc ClientStatus is running, but these two scenarios need to be handle differently and so we need to store something in state to be able to detect this difference.

Current implementation uses the presence of a TaskClientReconnected task event to detect if an alloc needs to reconnect, but this event is still present even after the alloc reconnects, so the scheduler always consider the alloc as still reconnecting.

While testing the fix for #14925 I would often hit #12680, which prevented the issue from being triggered since the extra alloc would not be considered as "reconnecting", so I included a fix for #12680 in this PR as well since their root cause is similar.

Clients use three main RPC methods to communicate its state with servers:

  • Node.GetAllocs reads allocation data from the server and writes it to the client.
  • Node.UpdateAlloc reads allocation from from the client and writes them to the server.
  • Node.UpdateStatus writes the client status to the server and is used as the heartbeat mechanism.

These RPC methods are called periodically by the client, and independently from each other. The usual mental model is that clients heartbeat first with Node.UpdateStatus and then update their allocation data with Node.UpdateAlloc, but this is not always true. If these two operations are reversed the diagram above will look like this:

node-disconnect drawio2

The state at the second last eval (Node updates allocs) looks just like the state when the node missed its heartbeat (Node misses heartbeat): Node.Status: disconnected and Alloc.ClientStatus: running.

This PR addresses these problems in the following ways:

  • Don't rely on task events since only a limited number of them are kept in state
  • Record when a allocation reconnects as an AllocState entry to keep it consistent with how disconnects are recorded
  • Check if the alloc needs to reconnect by verifying the last AllocState, it it's unknown the alloc reconnect has not been processed yet
  • Enforce that nodes register and heartbeat successfully before being able to update their allocs. This makes the reconnect flow easier to reason about and more predictable.

The commits are split by different work chunks:

This is an alternative fix for #14948.

Closes #14925
Closes #12680

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/b-update-reconnected-alloc/hugely-integral-pug branch from ddcfb22 to 6b88ce1 Compare November 4, 2022 20:26
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 54a25a5 into release/1.4.x Nov 4, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/b-update-reconnected-alloc/hugely-integral-pug branch November 4, 2022 20:26
@github-actions
Copy link

github-actions bot commented Mar 5, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants