-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNI: use tmpfs location for ipam plugin #24650
Conversation
When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
I'm going to move this back into draft until we've had a bit more time to look into the CNI check command workflow being discussed in #24292 (comment) |
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I tested this using an AWS environment with 1 server and 1 client instance. The client instance has CNI v1.3.0 plugins installed.
JobSpec:
job "example" {
group "webserver" {
network {
mode = "bridge"
port "http" {
to = 80
}
}
task "python3" {
driver = "exec"
config {
command = "python3"
args = ["-m", "http.server", "80"]
}
resources {
cpu = 500
memory = 256
}
}
}
}
Running through the steps using a build from main f4529485563924462dbdccdd1b4cacbd9d68616e
when I rebooted the instance the allocation failed with the error detailed below:
2024-12-13T10:57:44Z Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.12 has been allocated to 0dc525f4-c491-7e25-a67d-0121069ad55e, duplicate allocation is not allowed
2024-12-13T10:57:41Z Failed Restoring Task failed to restore task; will not run until server is contacted
I then tested this patch 9e1a365ae3d22f04c6bb8aa0ce0fee6d1f83ae6
f I performed the same steps as detailed in the PR and performed in the previous test. When the instance was rebooted the following task events are recorded:
2024-12-13T11:08:42Z Started Task started by client
2024-12-13T11:08:41Z Failed Restoring Task failed to restore task; will not run until server is contacted
The allocation HTTP server is accessible and responds as it did before the reboot.
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650 retry and tests [squashme]
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
@tgross this change broke upgrades for existing workloads.. The host-local IPAM storage path changed from |
Hey @tgross we upgraded from 1.8.5 to 1.8.9 |
Hi @ygersie I dug into this and apparently #24658 not only didn't land in the same release as 1.9.4 but when we shipped 1.9.5 (which is the equivalent of your 1.8.9+ent), the backport process dropped the ball and so those commits did not get released in 1.8.9+ent (and 1.7.17+ent) either. That's embarrassing! But they're already in the release branch that should be shipping very soon (todayish) |
@tgross I may be missing something here but even a CHECK wouldn't help prevent this issue? The state store of the host-local IPAM plugin has moved without copying current state. So next time an allocation is spun up there's no way for the host-local plugin to determine which IPs are still available, meaning you will likely get overlap. AFAICT the only way to prevent this would be to migrate / copy the state dir to the new location? |
Ok, I think I see what you mean. Because we're checking against a configuration previously persisted in the client state DB, we should be restoring the allocations just fine (and this was the case in testing), but future allocations can't take advantage of the state the host-local plugin is writing. I didn't take that into account while testing the original patch, unfortunately. But oddly enough in some quick smoke tests, the old IP does appear to be copied into the new location, but I'm not sure what the mechanism of that is right off the top of my head. I'll investigate that detail and report back. |
The host-local CNI plugin writes the handed out ip addresses to a local state dir. Now that the state dir location changed paths the new dir was empty. That means that the host-local ip started handing out ip addresses which are already in use. The only way to prevent this is by copying/moving the contents of the host-local plugin state dir. |
Ok, yeah I've now confirmed that behavior. My prior test was bogus because the first IP was selected from the beginning of the range. The client is getting CNI to hand out a new IP, but that IP isn't the one that the running network namespace has already been configured with, and can overlap the range. Let me look into what the best way to make that migration work is. |
In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650
In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650
PR is up here: #25093 |
In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650
…25093) In #24650 we switched to using ephemeral state for CNI plugins, so that when a host reboots and we lose all the allocations we don't end up trying to use IPs we created in network namespaces we just destroyed. Unfortunately upgrade testing missed that in a non-reboot scenario, the existing CNI state was being used by plugins like the ipam plugin to hand out the "next available" IP address. So with no state carried over, we might allocate new addresses that conflict with existing allocations. (This can be avoided by draining the node first.) As a compatibility shim, copy the old CNI state directory to the new CNI state directory during agent startup, if the new CNI state directory doesn't already exist. Ref: #24650
When a Nomad host reboots, the network namespace files in the tmpfs in
/var/run
are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at/var/lib/cni
. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too.Reconfigure the CNI bridge configuration to use
/var/run/cni
as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too inlibcni
.Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state.
Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
Testing & Reproduction steps
Run a cluster on a set of VMs, with at least one client. This can't be a server+client because we need to reboot the hosts. You should probably set the
server.heartbeat_grace = "5m"
to give yourself time to work.network.mode = "bridge"
. Wait for it to be healthy.Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.