CNI: use tmpfs location for ipam plugin #24650

tgross · 2024-12-11T20:45:29Z

When a Nomad host reboots, the network namespace files in the tmpfs in /var/run are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at /var/lib/cni. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too.

Reconfigure the CNI bridge configuration to use /var/run/cni as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in libcni.

Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state.

Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files

Testing & Reproduction steps

Run a cluster on a set of VMs, with at least one client. This can't be a server+client because we need to reboot the hosts. You should probably set the server.heartbeat_grace = "5m" to give yourself time to work.

Run a non-Docker task with network.mode = "bridge". Wait for it to be healthy.
Reboot the client host.
Make sure the alloc is restored, the tasks are restarted, and networking works.

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files

tgross · 2024-12-12T19:32:27Z

I'm going to move this back into draft until we've had a bit more time to look into the CNI check command workflow being discussed in #24292 (comment)

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

jrasell

LGTM!

I tested this using an AWS environment with 1 server and 1 client instance. The client instance has CNI v1.3.0 plugins installed.

JobSpec:

job "example" {

  group "webserver" {
    network {
      mode = "bridge"
      port "http" {
        to = 80
      }
    }

    task "python3" {
      driver = "exec"

      config {
        command = "python3"
        args    = ["-m", "http.server", "80"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Running through the steps using a build from main f4529485563924462dbdccdd1b4cacbd9d68616e when I rebooted the instance the allocation failed with the error detailed below:

2024-12-13T10:57:44Z  Setup Failure          failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.12 has been allocated to 0dc525f4-c491-7e25-a67d-0121069ad55e, duplicate allocation is not allowed
2024-12-13T10:57:41Z  Failed Restoring Task  failed to restore task; will not run until server is contacted

I then tested this patch 9e1a365ae3d22f04c6bb8aa0ce0fee6d1f83ae6f I performed the same steps as detailed in the PR and performed in the previous test. When the instance was rebooted the following task events are recorded:

2024-12-13T11:08:42Z  Started                Task started by client
2024-12-13T11:08:41Z  Failed Restoring Task  failed to restore task; will not run until server is contacted

The allocation HTTP server is accessible and responds as it did before the reboot.

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650 retry and tests [squashme]

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

ygersie · 2025-02-11T11:46:57Z

@tgross this change broke upgrades for existing workloads.. The host-local IPAM storage path changed from /var/lib/cni/networks/nomad/ to /var/run/cni/nomad/. This caused our workloads to get duplicate IP assignments leading to all kinds of issues.

tgross · 2025-02-11T13:39:28Z

@ygersie are you on 1.9.4 or 1.9.5? #24658 didn't end up landing in the same release as 1.9.4 but that was supposed to avoid upgrade problems. But it could also just be that #24658 is actually causing the problem, so it might help to know which you're on.

ygersie · 2025-02-11T14:27:47Z

Hey @tgross we upgraded from 1.8.5 to 1.8.9

tgross · 2025-02-11T15:15:25Z

Hi @ygersie I dug into this and apparently #24658 not only didn't land in the same release as 1.9.4 but when we shipped 1.9.5 (which is the equivalent of your 1.8.9+ent), the backport process dropped the ball and so those commits did not get released in 1.8.9+ent (and 1.7.17+ent) either. That's embarrassing! But they're already in the release branch that should be shipping very soon (todayish)

ygersie · 2025-02-11T17:17:20Z

@tgross I may be missing something here but even a CHECK wouldn't help prevent this issue? The state store of the host-local IPAM plugin has moved without copying current state. So next time an allocation is spun up there's no way for the host-local plugin to determine which IPs are still available, meaning you will likely get overlap. AFAICT the only way to prevent this would be to migrate / copy the state dir to the new location?

tgross · 2025-02-11T18:36:51Z

Ok, I think I see what you mean. Because we're checking against a configuration previously persisted in the client state DB, we should be restoring the allocations just fine (and this was the case in testing), but future allocations can't take advantage of the state the host-local plugin is writing. I didn't take that into account while testing the original patch, unfortunately.

But oddly enough in some quick smoke tests, the old IP does appear to be copied into the new location, but I'm not sure what the mechanism of that is right off the top of my head. I'll investigate that detail and report back.

ygersie · 2025-02-11T18:57:38Z

The host-local CNI plugin writes the handed out ip addresses to a local state dir. Now that the state dir location changed paths the new dir was empty. That means that the host-local ip started handing out ip addresses which are already in use. The only way to prevent this is by copying/moving the contents of the host-local plugin state dir.

tgross · 2025-02-11T19:08:37Z

Ok, yeah I've now confirmed that behavior. My prior test was bogus because the first IP was selected from the beginning of the range. The client is getting CNI to hand out a new IP, but that IP isn't the one that the running network namespace has already been configured with, and can overlap the range. Let me look into what the best way to make that migration work is.

In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650

tgross · 2025-02-11T21:05:36Z

PR is up here: #25093

In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650

…25093) In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650

vercel bot deployed to Preview – nomad-ui December 11, 2024 20:46 View deployment

tgross force-pushed the b-24292-ipam branch from ff8e4e4 to 6e8bf84 Compare December 11, 2024 20:53

tgross added type/bug backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line labels Dec 11, 2024

tgross added this to the 1.9.x milestone Dec 11, 2024

vercel bot deployed to Preview – nomad-ui December 11, 2024 20:54 View deployment

tgross requested review from gulducat and jrasell December 11, 2024 21:00

tgross marked this pull request as ready for review December 11, 2024 21:00

tgross requested review from a team as code owners December 11, 2024 21:00

tgross mentioned this pull request Dec 12, 2024

CNI: Docker container lose network configuration after host reboot #24292

Closed

tgross force-pushed the b-24292-ipam branch from 6e8bf84 to 9e1a365 Compare December 12, 2024 14:45

vercel bot deployed to Preview – nomad-ui December 12, 2024 14:46 View deployment

tgross requested a review from shoenig December 12, 2024 16:12

tgross marked this pull request as draft December 12, 2024 19:31

tgross mentioned this pull request Dec 12, 2024

CNI: use check command when restoring from restart #24658

Merged

6 tasks

tgross changed the title ~~cni: use tmpfs location for ipam plugin~~ CNI: use tmpfs location for ipam plugin Dec 12, 2024

tgross removed the request for review from gulducat December 12, 2024 20:52

jrasell approved these changes Dec 13, 2024

View reviewed changes

tgross marked this pull request as ready for review December 13, 2024 15:55

tgross marked this pull request as draft December 13, 2024 15:55

tgross marked this pull request as ready for review December 13, 2024 18:42

tgross merged commit 24fa743 into main Dec 16, 2024
26 checks passed

tgross deleted the b-24292-ipam branch December 16, 2024 14:36

hc-github-team-nomad-core mentioned this pull request Dec 16, 2024

Backport of CNI: use tmpfs location for ipam plugin into release/1.9.x #24681

Merged

6 tasks

hc-github-team-nomad-core mentioned this pull request Jan 7, 2025

Backport of CNI: use check command when restoring from restart into release/1.9.x #24796

Merged

6 tasks

tgross mentioned this pull request Feb 11, 2025

CNI: migrate from persistent state to ephemeral state during restart #25093

Merged

6 tasks

hc-github-team-nomad-core mentioned this pull request Feb 12, 2025

Backport of CNI: migrate from persistent state to ephemeral state during restart into release/1.9.x #25100

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI: use tmpfs location for ipam plugin #24650

CNI: use tmpfs location for ipam plugin #24650

tgross commented Dec 11, 2024 •

edited

Loading

tgross commented Dec 12, 2024

jrasell left a comment

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

tgross commented Feb 11, 2025

CNI: use tmpfs location for ipam plugin #24650

CNI: use tmpfs location for ipam plugin #24650

Conversation

tgross commented Dec 11, 2024 • edited Loading

Testing & Reproduction steps

Contributor Checklist

Reviewer Checklist

tgross commented Dec 12, 2024

jrasell left a comment

Choose a reason for hiding this comment

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

ygersie commented Feb 11, 2025

tgross commented Feb 11, 2025

tgross commented Feb 11, 2025

tgross commented Dec 11, 2024 •

edited

Loading