Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows container networking is not stable #6093

Closed
jvnvenu opened this issue Jun 3, 2024 · 8 comments
Closed

Windows container networking is not stable #6093

jvnvenu opened this issue Jun 3, 2024 · 8 comments

Comments

@jvnvenu
Copy link

jvnvenu commented Jun 3, 2024

Environmental Info:
RKE2 Version:
1.29.2

Node(s) CPU architecture, OS, and Version:
4 Nodes, 3 Linux (Redhat 8), 1 windows (windows server 2022)

Cluster Configuration:
3 linux servers, 1 windows agent

Describe the bug:
I have multiple deployments which have init and normal containers in each one. The init container will call the some Kube APIs to get resources information also update some annotations in the same deployment. Also each deployment has a service running on and one service can communicate with another. The Linux init container and communication between linux service works fine.
But windows init container when it is calling Kuber api get resource information it is getting connection closed by remote host issue. Then it crashing and starting again. Sometime it is starting without error. Sometime 2 or 3 times get this error then starts fine.
Same way from normal containers if it is trying to access a linux service it is getting timeout issue. After 1 or 2 retry it is successful.
It looks to me the windows rke doesn't have stable kubenetes connectivity with windows Nodes.
In linux we are using calico.

Steps To Reproduce:
Run a application to update annotations from init container for the same running deployment.

Expected behavior:
Should not see any communication failures in windows containers

Actual behavior:
Lot of communication failures in windows containers

Additional context / logs:

@manuelbuil
Copy link
Contributor

Do you have network policies? kubectl get netpol -A. There is a known issue in the windows tool that calico uses to program network policies in Windows: https://docs.tigera.io/calico/latest/getting-started/kubernetes/windows-calico/limitations#pod-to-pod-connections-are-dropped-with-tcp-reset-packets

@jvnvenu
Copy link
Author

jvnvenu commented Jun 3, 2024

We don't have any customized polices. I ran the command the results are given below. Looks to me these are default. Please share your thoughts.

NAMESPACE - NAME - POD-SELECTOR
cattle-fleet-local-system - default-allow-all -
cattle-provisioning-capi-system - default-allow-all -
kube-system - rancher-monitoring-coredns-allow-all - k8s-app=kube-dns

@manuelbuil
Copy link
Contributor

manuelbuil commented Jun 3, 2024

It does not matter if they are default. Any creation of a pod matching those policies or any deletion of a pod matching those policies will reset the ACL object in Windows. As a consequence, any TCP Session established by any pod included in that ACL will TCP-RST

@manuelbuil
Copy link
Contributor

Give cni: flannel a try and see if you are still experiencing the same instability. That will help you understand if this is the issue. Flannel does not implement any network policy, so this problem is not affecting it

@jvnvenu
Copy link
Author

jvnvenu commented Jun 3, 2024

But the above mentioned polices are not selecting the windows pods. It is selecting 1 pod each policy and those are running under linux nodes.
Can flannel be used as a production setup instead of calico with linux and windows nodes same time. Because as per RKE2 documentation

Windows Air-Gap Install
Windows Support is currently Experimental as of v1.21.3+rke2r1 Windows Support requires choosing Calico as the CNI for the RKE2 cluster

@brandond
Copy link
Member

brandond commented Jun 3, 2024

Note that the docs say

Some ingress or egress policy that applies to a pod contains selectors and the set of endpoints that those selectors match changes.

Note that the endpoint that the Windows pod communicates with does not ALSO have to be on Windows; the Windows pod may experience a disruption if a pod on a completely different node is recreated and the endpoint IP changes. This is a defect in Windows's HNS subsystem, not in canal, containerd, kubernetes, or RKE2.

Please try flannel as @manuelbuil suggested.

@manuelbuil
Copy link
Contributor

But the above mentioned polices are not selecting the windows pods. It is selecting 1 pod each policy and those are running under linux nodes. Can flannel be used as a production setup instead of calico with linux and windows nodes same time. Because as per RKE2 documentation

Windows Air-Gap Install Windows Support is currently Experimental as of v1.21.3+rke2r1 Windows Support requires choosing Calico as the CNI for the RKE2 cluster

Due to the network policy problem, we introduced Flannel recently. I missed that line in the docs! Thanks, I'll change it :)

Flannel can be used as production, yes

@manuelbuil
Copy link
Contributor

rancher/rke2-docs#220

@rancher rancher locked and limited conversation to collaborators Jun 4, 2024
@brandond brandond converted this issue into discussion #6100 Jun 4, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants