Container is not getting additional interface on k8s node reboot #1387

glebkin · 2025-02-12T16:09:20Z

Hi! We're using Multus with Cilium to supply some of our apps with additional interfaces.

Example of CR:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: nad-dhcp
spec:
  config: '{
             "cniVersion":"0.3.1",
             "name":"nad-dhcp",
             "plugins":[
                {
                   "type":"bridge",
                   "bridge":"br111",
                   "isGateway":true,
                   "ipam":{
                      "type":"static",
                      "addresses":[
                         {
                            "address":"169.254.169.103/24",
                            "gateway":"169.254.169.1"
                         }
                      ],
                      "routes":[
                         {
                            "dst":"169.252.252.0/24",
                            "gw":"169.254.169.1"
                         }
                      ]
                   }
                }
             ]
          }'

The problem is that after node reboot our container is not getting additional interface from Multus, instead we can the see the following errors in Multus logs:

time="2025-02-12T16:02:07Z" level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint][404] deleteEndpointNotFound " subsys=cilium-cni
time="2025-02-12T16:02:07Z" level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni

To resolve this issue, we need to restart the pod that uses the Multus additional interface.
Additionally, if we simply kill the container within a pod, it will be restored with the additional interface, but there will be no log messages from Multus. It appears that Cilium is handling all the work.

When using Multus with Flannel, we didn't experience these issues.
Can you please advise on what we might be doing wrong?

The text was updated successfully, but these errors were encountered:

dougbtv · 2025-02-13T14:36:46Z

Have you checked the /etc/cni/net.d dir to see if cilium deleted the multus config?

glebkin · 2025-02-13T19:54:30Z

Have you checked the /etc/cni/net.d dir to see if cilium deleted the multus config?

Many thanks for the tip! I found that Multus automatically generates 00-multus.conf in /etc/cni/net.d dir and automatically deletes it on shutdown.
So each time we restart node - multus deletes its config. When kubelet starts - it starts Multus and our app almost at the same time and our app gets only primary interface from Cilium, as Multus config still not exists.
So I tried to comment piece of code with config deletion - now if you restart multus and app using additional interface at the same time - it works fine.
But I faced another issue on node reboot - multus is getting SIGTERM multiple times (probably from kubelet) and after ~5 minutes and multiple restarts (5) it becomes ready.

glebkin · 2025-02-14T08:29:46Z

Well, reboot was due to OOM))
Increased resources - now everything is fine. I'm just wondering whether it's okay to keep /etc/cni/net.d/00-multus.conf file even if multus is deleted..

So what was actually happening in our environment:

k8s node is up and running, our apps with multus additional annotations up and running
Node reboot
multus-daemon container is stopped, /etc/cni/net.d/00-multus.conf is deleted
Node restarted
Cilium started and placed its /etc/cni/net.d/05-cilium.conflist file
kubelet invokes PodSandbox re-create event
containerd makes CNI request to cilium, as multus is not ready yet
Primary network interface is created and attached to our pod
Now multus comes into play and places /etc/cni/net.d/00-multus.conf file
containerd triggers on /etc/cni/net.d/ dir change event, but it's too late, as PodSandbox was already re-created.
Our Pod with additional interface will never get it's additional interface, because it's reassigned on PodSandbox re-creation event, so we fall into Crashloop event. Only pod deletion helps.

Same thing will happen if you just delete multus pod and pod with additional interface at the same time.

Jc2k · 2025-02-24T11:04:16Z

I have also seen pods come up with their multus interfaces missing, also around reboots. I use calico. In my case the pods look like they are running normally (they bind 0.0.0.0, so don't notice anything is wrong).

It feels like it only started happening in the 0.4 series, but I couldn't say when.

From your description, this is a simple race that could happen to anyone at any time not just due to OOM. And that tracks with my older nodes being more of a problem.

If upstream don't like your patch / don't have resources to help, can we get calico/cilium to write its cni to a different folder (/etc/cni/net.d-real), then set the multus config option "confDir" to that too. Then multus would be the only CNI in /etc/cni.d/net.d?

glebkin · 2025-02-26T07:47:00Z

@Jc2k unfortunately, I didn't find the confDir parameter in the current multus master. It seems that the docs are a bit outdated. I can only see cniConfigDir, but it looks like we can't split multus config and any other CNI configs:

	watcher, err := newWatcher(config.MultusAutoconfigDir, readinessIndicatorPath)
	if err != nil {
		return nil, err
	}

	if defaultCNIPluginName == fmt.Sprintf("%s/%s", config.MultusAutoconfigDir, multusConfigFileName) {
		return nil, logging.Errorf("cannot specify %s/%s to prevent recursive config load", config.MultusAutoconfigDir, multusConfigFileName)
	}

	configManager := &Manager{
		configWatcher:              watcher,
		multusConfig:               &config,
		multusConfigDir:            config.MultusAutoconfigDir,
		multusConfigFilePath:       filepath.Join(config.CniConfigDir, multusConfigFileName),
		primaryCNIConfigPath:       filepath.Join(config.MultusAutoconfigDir, defaultCNIPluginName),
		readinessIndicatorFilePath: config.ReadinessIndicatorFile,
	}

dougbtv added the help wanted label Feb 13, 2025

glebkin mentioned this issue Feb 14, 2025

Add multusKeepConfig option #1391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container is not getting additional interface on k8s node reboot #1387

Container is not getting additional interface on k8s node reboot #1387

glebkin commented Feb 12, 2025

dougbtv commented Feb 13, 2025

glebkin commented Feb 13, 2025

glebkin commented Feb 14, 2025 •

edited

Loading

Jc2k commented Feb 24, 2025

glebkin commented Feb 26, 2025

Container is not getting additional interface on k8s node reboot #1387

Container is not getting additional interface on k8s node reboot #1387

Comments

glebkin commented Feb 12, 2025

dougbtv commented Feb 13, 2025

glebkin commented Feb 13, 2025

glebkin commented Feb 14, 2025 • edited Loading

Jc2k commented Feb 24, 2025

glebkin commented Feb 26, 2025

glebkin commented Feb 14, 2025 •

edited

Loading