-
Notifications
You must be signed in to change notification settings - Fork 674
frequently pod lose ability to connect to pod running on any other k8s nodes until weave-net is bounced #3641
Comments
In case it can help I am sharing the temporary mitigation we put in place to detect and bounce weave-net quickly as the situation occur until the root cause be identified. This assume :
Basically we added a hackish liveness probe in the weave daemon set as follow : weave-net.yml (click to expand)
|
A few error in the logs that catched my attention and trying to learn more on (click to expand)
|
@pfcarrier thanks for a detailed bug report Weave-net pod However it does look a like that particular VM lost complete network connectivity to the rest of the nodes as indicated in the logs. If you see at timestamp
As you see in the logs connections were reestablished with in milli seconds.
Clearly there is temporary network glitch for that VM, that might have been caused by VMware environment or network.
If there is underlay network connectivity issues, this is going to impact overlay connectivity. Loosing pod-to-pod connectivity is expected. But do you have any insight into how long did it took to recover (pod-to-pod connectivity)? There are couple of weave-net issue that are visible in the logs.
Would you able to enable debug logging for weave-net and share the logs when you notice this behaviour again?
I will give it a try to see if I can reproduce similar symptoms |
Thanks so much for looking into it @murali-reddy I will definitely enable debug logging. This will take me a couple of day to wait for a quiet period so I can remove the workaround and capture all logs without a restart occuring.
This is the thing that puzzle me, for as long as this node was on I will try to confirm/infirm that by setting Took the general idea from #2354 ; I will further try to pinpoint what happen to the packets when pod-to-pod connectivity is lost using inspiration from that thread. |
I wholeheartedly agree that everything in the logs point toward that. |
Yes, setting the environment variable is the only way.
Any errors in the logs when sleeve mode is active? |
This did not reproduce the issue for me. Once connections are dropped and re-established between the peers, they all ends up using |
Thanks so much for your assistance @murali-reddy, apologies for the delay.
Nothing that I can spot. As far as I can tell logs indicate that everything is going fine. E.g. in the original pod log weave-net-s4vk9 all the event after 4:23 up to the end of the log ( about 8 hours ) is for the time period where that pod was operating in Workload on the cluster still prevent me to remove the work around that restart weave-net pod about 30 sec after inter-pod connectivity is lost. Still waiting for a windows to proceed to that ; in the meantime I rolled out the activation of debug mode and captured the logs up to the point of restarts. With debug enabled I tracked 4 events of pod automatically restarting due to lost of inter-pod communication ; for each events I included logs since 4am in the morning as well as the logs after the restart is triggered. Finally I included a 5th pod that didn't restart so we can observe it's visibility of the events. One can search in the logs files for a string I also include the relevant logs output for each nodes, alas all we can witness in them is the liveness probe detecting lost of inter-pod communication and triggering a restart. Globalkubectl get pod -owide -nkube-system | grep weave-net (click to expand)
kubectl get nodes (click to expand)
LogsDebug enabled logs for weave-net pods :
Dmesg and other command output for each nodes : kube-node-09 (click to expand)
kube-node-14 (click to expand)
kube-node-33 (click to expand)
kube-node-39 (click to expand)
kube-node-42 (click to expand)
|
This might be the same issue as #3619 |
@pfcarrier thanks for the logs. I don't see any thing particularly suspicious For e.g. it appears snip from
snip from
snip from
After connections reestablished I do see the activity of learning the MAC and configuring the ODP to forward packets to Perhaps when this happens, you can try to trace the pod-to-pod traffic and see where its getting dropped. |
I'm seeing the exact same type of issue. Our environments are almost exactly the same. I also loose all inter-pod communication on random VM's in a VMWare environment. Thank you for the workaround in terms of adding connectivity checks in the health checks for the pods @pfcarrier, great idea. Once this happens again I will post info as well. In my case it takes a few hours, sometimes a day or two to happen. |
Thanks for the report @WMostert1. Glad to hear the workaround helped :-) Digging deeper on that issue we ended up tracking the position of each VMs to it's given ESX host in the VMware cluster ; crossmatching it with time of event where This highlighted that 25% of those events was matching the exact moment a VMotion occured. Confirming @murali-reddy early analysis that something in the VMware environment or network was introducing a glitch in the equation. The rest of the 75% case is a bit of guesswork but we believe ultimately linked to VMs having a very high (+15%) VMWAIT% ( as reported by esxtop ). We theorized that this tug of war between VMs fighting for access to the CPU was the reason for the high amount of packet drop that potentially was a source of issue for any UDP traffic, which weave happen to use. My understanding is that in VMware a 4vCPU VM need to reserve all four vCPU before it start using any of them ; which can start being an issue if the VM is collocated on a host with other VMs that have both a very high number of vCPU and also high CPU usage. At the time of our observations our kubernetes cluster was composed of VMs sporting 16vCPU each ; on the ESX cluster there were also a few whale VMs at 50vCPU as well as many behemoths at 24vCPU. Given that each ESX host had only 32 real cores this was creating a lot of contention. Sadly by the time we proceeded to right size the whales and behemoths the kubernetes cluster running weave was decommissioned. While VMware it not my area of expertise and I cannot confirm 100% this would had fix our issue my hunch would be to recommend the following to VMware users running into the same scenario :
|
I will close the issue since I lost the ability to iterate on it with the cluster no longer being in existence. As a last report I would like to share that last week we observed on that same ESX cluster an issue were a dozen of non-kubernetes VMs all lost network access for a few minutes at the same time. This happened after the VMs right sizing was complete so my hypothesis in the comment just above potentially linking issue to CPU contention is likely incorrect or at best was only a factor. Hope it help. |
Thanks a lot, dear @pfcarrier for the detailed comments.
We have exactly the same issue inside a single ESXi VMware host, containing 3 Kubernetes nodes VMs, that allocated more CPUs than the host to the VMs. |
What you expected to happen?
Inter nodes pods communication to be uninterrupted and always available.
What happened?
Multiple time per day we observe a random VM in the cluster start to have all it's pods unable to talk to pods located on any other nodes then itself. Other nodes also lose ability to connect to impacted pods running on that nodes be it through direct communication with pod IP or via a k8s services.
This make for an interesting failure mode since coredns run within a set of pods that is likely on a different node so name resolution ability is lost. Kubernetes services also keep sending traffic to those unreachable pods since liveness/readiness check run locally by the kubelet still return ok.
During those event we observe weave switch from fastdp to sleeve. Output seem to indicate it manage to re-establish all connection within seconds however the inter nodes pods communication is lost in the processes. We tried to drain the node and let it run for multiple hours, it never recover by itself. Deleting the weave-net pod so it be recreated restore the situation.
How to reproduce it?
So fares I am not able to reproduce this situation in other environments. The closest I got was by running iptables command and drop udp traffic in/out for port 6784 udp.
Reaching out for guidance on pinpointing things and understanding weave log and confirm/infirm if this is an expected failure mode.
Anything else we need to know?
This is a medium size cluster of 42 nodes installed with kubeadm running on-premises in vmware. This started occur sporadically 3 weeks ago and as time pass frequency of that event increase. At this point this happen about 5 time per day. Previously this cluster was running without any issue for at least 1 year.
Versions:
Logs:
Thanks to find below link to gist and collapsable section (triangle is clickable) with all logs and command output I gathered. The event occur at 4:22.
Weave logs :
kubectl get nodes -owide (click to expand)
kubectl get pod -owide -lname=weave-net -nkube-system (click to expand)
kubectl get configmap weave-net -oyaml (click to expand)
I ran the following weave command on both weave-net pod above
./weave --local status
./weave --local status connections
./weave --local status ipam
./weave --local report
./weave --local status peers
I bundled the weave command output just above per pod :
netstat -i on kube-node-13 (click to expand)
netstat -i on kube-node-15 (click to expand)
dmesg output on kube-node-13
dmesg output on kube-node-15 (click to expand)
iptables rules and ifconfig output for both nodes
Weave-net had been bounced to restore services on those pod/nodes but for completeness the output of a few more commands :
on kube-node-13 (click to expand)
on kube-node-15(click to expand)
I Looked at the output of those command, 8 minutes after the beginning of the issue I see kubelet attempting to teardown some volume and container running on the nodes and matching activities in docker.services. Seem like "normal" noise.
$ journalctl -u docker.service --no-pager
$ journalctl -u kubelet --no-pager
Thing that we tried
The text was updated successfully, but these errors were encountered: