Windows container networking is not stable #6093

jvnvenu · 2024-06-03T04:42:04Z

Environmental Info:
RKE2 Version:
1.29.2

Node(s) CPU architecture, OS, and Version:
4 Nodes, 3 Linux (Redhat 8), 1 windows (windows server 2022)

Cluster Configuration:
3 linux servers, 1 windows agent

Describe the bug:
I have multiple deployments which have init and normal containers in each one. The init container will call the some Kube APIs to get resources information also update some annotations in the same deployment. Also each deployment has a service running on and one service can communicate with another. The Linux init container and communication between linux service works fine.
But windows init container when it is calling Kuber api get resource information it is getting connection closed by remote host issue. Then it crashing and starting again. Sometime it is starting without error. Sometime 2 or 3 times get this error then starts fine.
Same way from normal containers if it is trying to access a linux service it is getting timeout issue. After 1 or 2 retry it is successful.
It looks to me the windows rke doesn't have stable kubenetes connectivity with windows Nodes.
In linux we are using calico.

Steps To Reproduce:
Run a application to update annotations from init container for the same running deployment.

Expected behavior:
Should not see any communication failures in windows containers

Actual behavior:
Lot of communication failures in windows containers

Additional context / logs:

manuelbuil · 2024-06-03T08:17:04Z

Do you have network policies? kubectl get netpol -A. There is a known issue in the windows tool that calico uses to program network policies in Windows: https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/limitations#pod-to-pod-connections-are-dropped-with-tcp-reset-packets

jvnvenu · 2024-06-03T13:14:51Z

We don't have any customized polices. I ran the command the results are given below. Looks to me these are default. Please share your thoughts.

NAMESPACE - NAME - POD-SELECTOR
cattle-fleet-local-system - default-allow-all -
cattle-provisioning-capi-system - default-allow-all -
kube-system - rancher-monitoring-coredns-allow-all - k8s-app=kube-dns

manuelbuil · 2024-06-03T14:16:06Z

It does not matter if they are default. Any creation of a pod matching those policies or any deletion of a pod matching those policies will reset the ACL object in Windows. As a consequence, any TCP Session established by any pod included in that ACL will TCP-RST

manuelbuil · 2024-06-03T14:18:31Z

Give cni: flannel a try and see if you are still experiencing the same instability. That will help you understand if this is the issue. Flannel does not implement any network policy, so this problem is not affecting it

jvnvenu · 2024-06-03T16:49:05Z

But the above mentioned polices are not selecting the windows pods. It is selecting 1 pod each policy and those are running under linux nodes.
Can flannel be used as a production setup instead of calico with linux and windows nodes same time. Because as per RKE2 documentation

Windows Air-Gap Install
Windows Support is currently Experimental as of v1.21.3+rke2r1 Windows Support requires choosing Calico as the CNI for the RKE2 cluster

brandond · 2024-06-03T17:14:04Z

Note that the docs say

Some ingress or egress policy that applies to a pod contains selectors and the set of endpoints that those selectors match changes.

Note that the endpoint that the Windows pod communicates with does not ALSO have to be on Windows; the Windows pod may experience a disruption if a pod on a completely different node is recreated and the endpoint IP changes. This is a defect in Windows's HNS subsystem, not in canal, containerd, kubernetes, or RKE2.

Please try flannel as @manuelbuil suggested.

manuelbuil · 2024-06-04T08:42:49Z

But the above mentioned polices are not selecting the windows pods. It is selecting 1 pod each policy and those are running under linux nodes. Can flannel be used as a production setup instead of calico with linux and windows nodes same time. Because as per RKE2 documentation

Windows Air-Gap Install Windows Support is currently Experimental as of v1.21.3+rke2r1 Windows Support requires choosing Calico as the CNI for the RKE2 cluster

Due to the network policy problem, we introduced Flannel recently. I missed that line in the docs! Thanks, I'll change it :)

Flannel can be used as production, yes

manuelbuil · 2024-06-04T10:50:56Z

rancher/rke2-docs#220

rancher locked and limited conversation to collaborators Jun 4, 2024

brandond converted this issue into discussion #6100 Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Windows container networking is not stable #6093

Windows container networking is not stable #6093

jvnvenu commented Jun 3, 2024 •

edited

Loading

manuelbuil commented Jun 3, 2024

jvnvenu commented Jun 3, 2024

manuelbuil commented Jun 3, 2024 •

edited

Loading

manuelbuil commented Jun 3, 2024

jvnvenu commented Jun 3, 2024

brandond commented Jun 3, 2024 •

edited

Loading

manuelbuil commented Jun 4, 2024

manuelbuil commented Jun 4, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Windows container networking is not stable #6093

Windows container networking is not stable #6093

Comments

jvnvenu commented Jun 3, 2024 • edited Loading

manuelbuil commented Jun 3, 2024

jvnvenu commented Jun 3, 2024

manuelbuil commented Jun 3, 2024 • edited Loading

manuelbuil commented Jun 3, 2024

jvnvenu commented Jun 3, 2024

brandond commented Jun 3, 2024 • edited Loading

manuelbuil commented Jun 4, 2024

manuelbuil commented Jun 4, 2024

This issue was moved to a discussion.

jvnvenu commented Jun 3, 2024 •

edited

Loading

manuelbuil commented Jun 3, 2024 •

edited

Loading

brandond commented Jun 3, 2024 •

edited

Loading