Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

frequently pod lose ability to connect to pod running on any other k8s nodes until weave-net is bounced #3641

Closed
pfcarrier opened this issue May 9, 2019 · 14 comments

Comments

@pfcarrier
Copy link

pfcarrier commented May 9, 2019

What you expected to happen?

Inter nodes pods communication to be uninterrupted and always available.

What happened?

Multiple time per day we observe a random VM in the cluster start to have all it's pods unable to talk to pods located on any other nodes then itself. Other nodes also lose ability to connect to impacted pods running on that nodes be it through direct communication with pod IP or via a k8s services.

This make for an interesting failure mode since coredns run within a set of pods that is likely on a different node so name resolution ability is lost. Kubernetes services also keep sending traffic to those unreachable pods since liveness/readiness check run locally by the kubelet still return ok.

During those event we observe weave switch from fastdp to sleeve. Output seem to indicate it manage to re-establish all connection within seconds however the inter nodes pods communication is lost in the processes. We tried to drain the node and let it run for multiple hours, it never recover by itself. Deleting the weave-net pod so it be recreated restore the situation.

How to reproduce it?

So fares I am not able to reproduce this situation in other environments. The closest I got was by running iptables command and drop udp traffic in/out for port 6784 udp.

Reaching out for guidance on pinpointing things and understanding weave log and confirm/infirm if this is an expected failure mode.

Anything else we need to know?

This is a medium size cluster of 42 nodes installed with kubeadm running on-premises in vmware. This started occur sporadically 3 weeks ago and as time pass frequency of that event increase. At this point this happen about 5 time per day. Previously this cluster was running without any issue for at least 1 year.

Versions:

$ ./weave --local version
weave 2.5.1

$ docker version
Client:
 Version:      17.03.3-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   e19b718
 Built:        Thu Aug 30 01:04:10 2018
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.3-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   e19b718
 Built:        Thu Aug 30 01:04:10 2018
 OS/Arch:      linux/amd64
 Experimental: false

$ uname -a
Linux kube-node-04 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:26:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Thanks to find below link to gist and collapsable section (triangle is clickable) with all logs and command output I gathered. The event occur at 4:22.

Weave logs :

kubectl get nodes -owide (click to expand)
NAME                   STATUS                     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
kube-node-01   Ready                      master   8d    v1.13.5   192.168.62.191   <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-02   Ready                      <none>   8d    v1.13.5   192.168.62.192   <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-03   Ready                      <none>   8d    v1.13.5   192.168.62.193   <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-04   Ready                      <none>   8d    v1.13.5   192.168.62.20    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-05   Ready                      <none>   8d    v1.13.5   192.168.62.21    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-06   Ready                      <none>   8d    v1.13.5   192.168.62.22    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-07   Ready                      <none>   8d    v1.13.5   192.168.62.23    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-08   Ready                      <none>   8d    v1.13.5   192.168.62.24    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-09   Ready                      <none>   8d    v1.13.5   192.168.62.25    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-10   Ready                      <none>   8d    v1.13.5   192.168.62.26    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-11   Ready                      <none>   8d    v1.13.5   192.168.62.27    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-12   Ready                      <none>   8d    v1.13.5   192.168.62.28    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-13   Ready,SchedulingDisabled   <none>   8d    v1.13.5   192.168.62.29    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-14   Ready                      <none>   8d    v1.13.5   192.168.62.61    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-15   Ready                      <none>   8d    v1.13.5   192.168.62.31    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-16   Ready                      <none>   8d    v1.13.5   192.168.62.32    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-17   Ready                      <none>   8d    v1.13.5   192.168.62.33    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-18   Ready                      <none>   8d    v1.13.5   192.168.62.34    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-19   Ready                      <none>   8d    v1.13.5   192.168.62.35    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-20   Ready                      <none>   8d    v1.13.5   192.168.62.36    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-21   Ready                      <none>   8d    v1.13.5   192.168.62.37    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-22   Ready                      <none>   8d    v1.13.5   192.168.62.38    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-23   Ready                      <none>   8d    v1.13.5   192.168.62.39    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-24   Ready                      <none>   8d    v1.13.5   192.168.62.62    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-25   Ready                      <none>   8d    v1.13.5   192.168.62.41    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-26   Ready                      <none>   8d    v1.13.5   192.168.62.42    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-27   Ready                      <none>   8d    v1.13.5   192.168.62.43    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-28   Ready                      <none>   8d    v1.13.5   192.168.62.44    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-29   Ready                      <none>   8d    v1.13.5   192.168.62.45    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-30   Ready                      <none>   8d    v1.13.5   192.168.62.46    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-31   Ready                      <none>   8d    v1.13.5   192.168.62.47    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-32   Ready                      <none>   8d    v1.13.5   192.168.62.48    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-33   Ready                      <none>   8d    v1.13.5   192.168.62.49    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-34   Ready                      <none>   8d    v1.13.5   192.168.62.63    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-35   Ready                      <none>   8d    v1.13.5   192.168.62.51    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-36   Ready                      <none>   8d    v1.13.5   192.168.62.52    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-37   Ready                      <none>   8d    v1.13.5   192.168.62.53    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-38   Ready                      <none>   8d    v1.13.5   192.168.62.54    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-39   Ready                      <none>   8d    v1.13.5   192.168.62.55    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-40   Ready                      <none>   8d    v1.13.5   192.168.62.56    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-41   Ready                      <none>   8d    v1.13.5   192.168.62.57    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-42   Ready                      <none>   8d    v1.13.5   192.168.62.58    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3

* at the time of the issue kube-node-13 was not SchedulingDisabled, we drained it to allow us time to experiment without impacting services.
kubectl get pod -owide -lname=weave-net -nkube-system (click to expand)
NAME              READY   STATUS    RESTARTS   AGE   IP               NODE                   NOMINATED NODE   READINESS GATES
weave-net-2mbbc   2/2     Running   2          8d    192.168.62.41    kube-node-25   <none>           <none>
weave-net-2wsnw   2/2     Running   1          8d    192.168.62.48    kube-node-32   <none>           <none>
weave-net-524wg   2/2     Running   1          8d    192.168.62.36    kube-node-20   <none>           <none>
weave-net-6psxw   2/2     Running   2          8d    192.168.62.22    kube-node-06   <none>           <none>
weave-net-869q6   2/2     Running   0          39h   192.168.62.37    kube-node-21   <none>           <none>
weave-net-995q8   2/2     Running   1          8d    192.168.62.56    kube-node-40   <none>           <none>
weave-net-b2gqv   2/2     Running   0          8d    192.168.62.21    kube-node-05   <none>           <none>
weave-net-bgk8b   2/2     Running   3          8d    192.168.62.49    kube-node-33   <none>           <none>
weave-net-bp72h   2/2     Running   0          8d    192.168.62.31    kube-node-15   <none>           <none>
weave-net-dvgfs   2/2     Running   0          8d    192.168.62.42    kube-node-26   <none>           <none>
weave-net-f547b   2/2     Running   0          8d    192.168.62.23    kube-node-07   <none>           <none>
weave-net-f77hj   2/2     Running   0          8d    192.168.62.193   kube-node-03   <none>           <none>
weave-net-f8jr4   2/2     Running   0          8d    192.168.62.191   kube-node-01   <none>           <none>
weave-net-fx4r7   2/2     Running   0          3d    192.168.62.51    kube-node-35   <none>           <none>
weave-net-fxzjb   2/2     Running   1          8d    192.168.62.55    kube-node-39   <none>           <none>
weave-net-g57kk   2/2     Running   0          8d    192.168.62.32    kube-node-16   <none>           <none>
weave-net-gfshx   2/2     Running   1          8d    192.168.62.38    kube-node-22   <none>           <none>
weave-net-hd66z   2/2     Running   0          8d    192.168.62.26    kube-node-10   <none>           <none>
weave-net-jqmm2   2/2     Running   1          8d    192.168.62.192   kube-node-02   <none>           <none>
weave-net-kx7t5   2/2     Running   0          39h   192.168.62.52    kube-node-36   <none>           <none>
weave-net-l89cw   2/2     Running   0          8d    192.168.62.44    kube-node-28   <none>           <none>
weave-net-lsqmn   2/2     Running   1          8d    192.168.62.33    kube-node-17   <none>           <none>
weave-net-m2k6b   2/2     Running   1          8d    192.168.62.58    kube-node-42   <none>           <none>
weave-net-mh6mn   2/2     Running   1          8d    192.168.62.61    kube-node-14   <none>           <none>
weave-net-n5z4d   2/2     Running   0          3d    192.168.62.47    kube-node-31   <none>           <none>
weave-net-p94q2   2/2     Running   0          8d    192.168.62.46    kube-node-30   <none>           <none>
weave-net-phlnj   2/2     Running   1          8d    192.168.62.53    kube-node-37   <none>           <none>
weave-net-r99g5   2/2     Running   0          8d    192.168.62.63    kube-node-34   <none>           <none>
weave-net-rpfmh   2/2     Running   0          8d    192.168.62.25    kube-node-09   <none>           <none>
weave-net-s4vk9   2/2     Running   2          8d    192.168.62.29    kube-node-13   <none>           <none>
weave-net-s6dk5   2/2     Running   2          8d    192.168.62.20    kube-node-04   <none>           <none>
weave-net-s8pdp   2/2     Running   1          8d    192.168.62.43    kube-node-27   <none>           <none>
weave-net-sbq9v   2/2     Running   2          8d    192.168.62.28    kube-node-12   <none>           <none>
weave-net-sstpk   2/2     Running   0          8d    192.168.62.54    kube-node-38   <none>           <none>
weave-net-tcwhp   2/2     Running   1          8d    192.168.62.35    kube-node-19   <none>           <none>
weave-net-tf6sw   2/2     Running   0          47h   192.168.62.34    kube-node-18   <none>           <none>
weave-net-trl8d   2/2     Running   0          8d    192.168.62.24    kube-node-08   <none>           <none>
weave-net-wjgws   2/2     Running   1          8d    192.168.62.39    kube-node-23   <none>           <none>
weave-net-wnmvp   2/2     Running   2          8d    192.168.62.27    kube-node-11   <none>           <none>
weave-net-xqrjm   2/2     Running   2          8d    192.168.62.45    kube-node-29   <none>           <none>
weave-net-zn66g   2/2     Running   1          8d    192.168.62.62    kube-node-24   <none>           <none>
weave-net-zzns8   2/2     Running   1          8d    192.168.62.57    kube-node-41   <none>           <none>
kubectl get configmap weave-net -oyaml (click to expand)
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
  kube-peers.weave.works/peers: '{"Peers":[{"PeerName":"ce:d4:17:3b:00:db","NodeName":"kube-node-01"},{"PeerName":"1e:e4:18:98:ce:c4","NodeName":"kube-node-31"},{"PeerName":"da:ef:6b:19:cf:fe","NodeName":"kube-node-34"},{"PeerName":"de:30:16:77:64:d8","NodeName":"kube-node-30"},{"PeerName":"7e:ed:9d:df:ce:a6","NodeName":"kube-node-26"},{"PeerName":"ee:99:1b:17:84:cb","NodeName":"kube-node-42"},{"PeerName":"46:e6:49:7c:6b:88","NodeName":"kube-node-32"},{"PeerName":"96:13:be:0c:ac:e5","NodeName":"kube-node-33"},{"PeerName":"b2:24:32:da:61:d3","NodeName":"kube-node-40"},{"PeerName":"d6:38:1b:fb:97:a4","NodeName":"kube-node-08"},{"PeerName":"1a:a4:36:e2:2c:44","NodeName":"kube-node-11"},{"PeerName":"82:14:63:36:a5:de","NodeName":"kube-node-03"},{"PeerName":"8a:45:32:0a:c0:46","NodeName":"kube-node-02"},{"PeerName":"a2:ea:5b:4a:72:1d","NodeName":"kube-node-10"},{"PeerName":"be:f7:87:9e:3c:40","NodeName":"kube-node-07"},{"PeerName":"62:b7:f9:fe:10:ab","NodeName":"kube-node-41"},{"PeerName":"ba:24:c0:0b:ba:bd","NodeName":"kube-node-18"},{"PeerName":"22:41:be:0d:19:ea","NodeName":"kube-node-15"},{"PeerName":"7e:88:bb:7c:b9:cc","NodeName":"kube-node-16"},{"PeerName":"6a:1a:a0:47:c1:2a","NodeName":"kube-node-38"},{"PeerName":"5e:3a:1e:47:8c:3d","NodeName":"kube-node-29"},{"PeerName":"ea:11:40:96:af:24","NodeName":"kube-node-28"},{"PeerName":"ca:21:fb:33:fa:26","NodeName":"kube-node-19"},{"PeerName":"1e:69:6c:54:6d:af","NodeName":"kube-node-05"},{"PeerName":"4e:bd:fe:81:d4:73","NodeName":"kube-node-09"},{"PeerName":"fe:bd:2d:6d:f4:2d","NodeName":"kube-node-17"},{"PeerName":"86:96:14:d4:07:9b","NodeName":"kube-node-35"},{"PeerName":"fe:1f:d3:5f:49:fd","NodeName":"kube-node-25"},{"PeerName":"56:e4:c4:a0:8b:66","NodeName":"kube-node-36"},{"PeerName":"06:07:7f:a1:ae:96","NodeName":"kube-node-24"},{"PeerName":"8e:f2:46:a8:e4:3a","NodeName":"kube-node-21"},{"PeerName":"0e:ff:07:bc:7f:1f","NodeName":"kube-node-22"},{"PeerName":"2e:4e:bd:b4:a1:11","NodeName":"kube-node-27"},{"PeerName":"d6:e9:e9:56:3c:4c","NodeName":"kube-node-04"},{"PeerName":"be:91:56:9d:c7:06","NodeName":"kube-node-39"},{"PeerName":"ee:fc:09:be:00:df","NodeName":"kube-node-13"},{"PeerName":"ca:84:00:e6:57:a9","NodeName":"kube-node-12"},{"PeerName":"2a:8c:99:01:75:e3","NodeName":"kube-node-06"},{"PeerName":"66:4d:43:df:97:77","NodeName":"kube-node-20"},{"PeerName":"42:33:fc:1e:9a:fc","NodeName":"kube-node-23"},{"PeerName":"ca:9f:2d:88:c9:fa","NodeName":"kube-node-14"},{"PeerName":"92:c7:36:a5:eb:75","NodeName":"kube-node-37"}]}'
creationTimestamp: "2019-04-28T11:27:00Z"
name: weave-net
namespace: kube-system
resourceVersion: "6072"
selfLink: /api/v1/namespaces/kube-system/configmaps/weave-net
uid: 89f35158-69a8-11e9-a8c4-0050568733bc

I ran the following weave command on both weave-net pod above

  • ./weave --local status
  • ./weave --local status connections
  • ./weave --local status ipam
  • ./weave --local report
  • ./weave --local status peers

I bundled the weave command output just above per pod :

netstat -i on kube-node-13 (click to expand)
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 20597484      0      0 0           557      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 399613179      0   4455 0      225991905      0      0      0 BMRU
lo       65536   709418      0      0 0        709418      0      0      0 LRU
vethwe-b  1376 28421958      0      0 0       8790100      0      0      0 BMRU
vethwe-d  1376  8790100      0      0 0      28421958      0      0      0 BMRU
vethwepl  1376  1706061      0      0 0      21987839      0      0      0 BMRU
vethwepl  1376    69516      0      0 0      20667372      0      0      0 BMRU
vethwepl  1376   245776      0      0 0      20809465      0      0      0 BMRU
vethwepl  1376  2598116      0      0 0      23287412      0      0      0 BMRU
vxlan-67 65485 36022722      0      0 0      27108788      0      8      0 BMRU
weave     1376 201067249      0      0 0      170367479      0      0      0 BMRU

# one hour later

Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 20615012      0      0 0           558      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 400134345      0   4633 0      226116220      0      0      0 BMRU
lo       65536   713448      0      0 0        713448      0      0      0 LRU
vethwe-b  1376 28439487      0      0 0       8790235      0      0      0 BMRU
vethwe-d  1376  8790235      0      0 0      28439487      0      0      0 BMRU
vethwepl  1376  1709103      0      0 0      22007775      0      0      0 BMRU
vethwepl  1376    69915      0      0 0      20685302      0      0      0 BMRU
vethwepl  1376   246042      0      0 0      20827130      0      0      0 BMRU
vethwepl  1376  2608023      0      0 0      23316447      0      0      0 BMRU
vxlan-67 65485 36349024      0      0 0      27125107      0      8      0 BMRU
weave     1376 201098252      0      0 0      170381921      0      0      0 BMRU
netstat -i on kube-node-15 (click to expand)
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 20620436      0      0 0           457      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 400219170      0   4564 0      262336135      0      0      0 BMRU
lo       65536   711005      0      0 0        711005      0      0      0 LRU
vethwe-b  1376 42452135      0      0 0      22681381      0      0      0 BMRU
vethwe-d  1376 22681381      0      0 0      42452135      0      0      0 BMRU
vethwepl  1376  2623593      0      0 0       5498571      0      0      0 BMRU
vethwepl  1376    69939      0      0 0      20690681      0      0      0 BMRU
vethwepl  1376  4577435      0      0 0      25121590      0      0      0 BMRU
vethwepl  1376   203458      0      0 0       2248214      0      0      0 BMRU
vethwepl  1376 10789639      0      0 0      23471612      0      0      0 BMRU
vethwepl  1376   256315      0      0 0      20843557      0      0      0 BMRU
vethwepl  1376   519734      0      0 0      13256327      0      0      0 BMRU
vethwepl  1376  8311771      0      0 0      16242806      0      0      0 BMRU
vethwepl  1376   972504      0      0 0      14050125      0      0      0 BMRU
vethwepl  1376 26466230      0      0 0      37959971      0      0      0 BMRU
vethwepl  1376   984011      0      0 0      14125608      0      0      0 BMRU
vethwepl  1376  4309263      0      0 0      18503301      0      0      0 BMRU
vethwepl  1376  2010851      0      0 0      22292166      0      0      0 BMRU
vxlan-67 65485 51759435      0      0 0      55382541      0    140      0 BMRU
weave     1376 206375630      0      0 0      178486203      0      0      0 BMRU
dmesg output on kube-node-13
[Mon May  6 00:30:04 2019] weave: port 14(vethwepl8acded0) entered disabled state
[Mon May  6 00:30:04 2019] device vethwepl8acded0 left promiscuous mode
[Mon May  6 00:30:04 2019] weave: port 14(vethwepl8acded0) entered disabled state
[Mon May  6 00:30:04 2019] weave: port 19(vethwepl08d2b66) entered disabled state
[Mon May  6 00:30:04 2019] device vethwepl08d2b66 left promiscuous mode
[Mon May  6 00:30:04 2019] weave: port 19(vethwepl08d2b66) entered disabled state
[Mon May  6 00:30:05 2019] weave: port 10(vethweplb1da716) entered disabled state
[Mon May  6 00:30:05 2019] device vethweplb1da716 left promiscuous mode
[Mon May  6 00:30:05 2019] weave: port 10(vethweplb1da716) entered disabled state
[Mon May  6 00:30:05 2019] weave: port 15(vethwepl441ee0c) entered disabled state
[Mon May  6 00:30:05 2019] device vethwepl441ee0c left promiscuous mode
[Mon May  6 00:30:05 2019] weave: port 15(vethwepl441ee0c) entered disabled state
[Mon May  6 00:30:05 2019] weave: port 11(vethwepl25d7e34) entered disabled state
[Mon May  6 00:30:05 2019] device vethwepl25d7e34 left promiscuous mode
[Mon May  6 00:30:05 2019] weave: port 11(vethwepl25d7e34) entered disabled state
[Mon May  6 00:30:06 2019] weave: port 9(vethweplfbe9743) entered disabled state
[Mon May  6 00:30:06 2019] device vethweplfbe9743 left promiscuous mode
[Mon May  6 00:30:06 2019] weave: port 9(vethweplfbe9743) entered disabled state
[Mon May  6 00:30:07 2019] weave: port 21(vethwepl4fc1c8e) entered disabled state
[Mon May  6 00:30:07 2019] device vethwepl4fc1c8e left promiscuous mode
[Mon May  6 00:30:07 2019] weave: port 21(vethwepl4fc1c8e) entered disabled state
[Mon May  6 00:30:07 2019] weave: port 17(vethweplc31ce66) entered disabled state
[Mon May  6 00:30:07 2019] device vethweplc31ce66 left promiscuous mode
[Mon May  6 00:30:07 2019] weave: port 17(vethweplc31ce66) entered disabled state
[Mon May  6 00:30:08 2019] weave: port 7(vethwepldd82b30) entered disabled state
[Mon May  6 00:30:08 2019] device vethwepldd82b30 left promiscuous mode
[Mon May  6 00:30:08 2019] weave: port 7(vethwepldd82b30) entered disabled state
[Mon May  6 00:30:08 2019] weave: port 22(vethwepl0870155) entered disabled state
[Mon May  6 00:30:08 2019] device vethwepl0870155 left promiscuous mode
[Mon May  6 00:30:08 2019] weave: port 22(vethwepl0870155) entered disabled state
[Mon May  6 00:30:08 2019] weave: port 12(vethwepl440f01d) entered disabled state
[Mon May  6 00:30:08 2019] device vethwepl440f01d left promiscuous mode
[Mon May  6 00:30:08 2019] weave: port 12(vethwepl440f01d) entered disabled state
[Mon May  6 00:30:09 2019] weave: port 13(vethwepl21ad0c7) entered disabled state
[Mon May  6 00:30:09 2019] device vethwepl21ad0c7 left promiscuous mode
[Mon May  6 00:30:09 2019] weave: port 13(vethwepl21ad0c7) entered disabled state
[Mon May  6 00:30:26 2019] weave: port 8(vethwepl25e42e6) entered disabled state
[Mon May  6 00:30:26 2019] device vethwepl25e42e6 left promiscuous mode
[Mon May  6 00:30:26 2019] weave: port 8(vethwepl25e42e6) entered disabled state
[Mon May  6 00:31:00 2019] weave: port 6(vethwepl47ca67b) entered disabled state
[Mon May  6 00:31:00 2019] device vethwepl47ca67b left promiscuous mode
[Mon May  6 00:31:00 2019] weave: port 6(vethwepl47ca67b) entered disabled state
dmesg output on kube-node-15 (click to expand)
none for the day of the issue or the 2 day before

iptables rules and ifconfig output for both nodes

Weave-net had been bounced to restore services on those pod/nodes but for completeness the output of a few more commands :

on kube-node-13 (click to expand)
$ ip route sh
default via 192.168.62.1 dev eth0
10.32.0.0/12 dev weave proto kernel scope link src 10.45.192.0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.62.0/24 dev eth0 proto kernel scope link src 192.168.62.29

$ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.62.29/24 brd 192.168.62.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 10.45.192.0/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever
on kube-node-15(click to expand)
$ ip route sh
default via 192.168.62.1 dev eth0
10.32.0.0/12 dev weave proto kernel scope link src 10.47.192.0
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.62.0/24 dev eth0 proto kernel scope link src 192.168.62.31

$ ip -4 -o addr
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eth0    inet 192.168.62.31/24 brd 192.168.62.255 scope global eth0\       valid_lft forever preferred_lft forever
3: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
6: weave    inet 10.47.192.0/12 brd 10.47.255.255 scope global weave\       valid_lft forever preferred_lft forever

I Looked at the output of those command, 8 minutes after the beginning of the issue I see kubelet attempting to teardown some volume and container running on the nodes and matching activities in docker.services. Seem like "normal" noise.
$ journalctl -u docker.service --no-pager
$ journalctl -u kubelet --no-pager

Thing that we tried

  • Deleting /var/lib/weave/* on all nodes and changing the weave-net daemonset to trigger a rollout. Situation persist.
@pfcarrier
Copy link
Author

In case it can help I am sharing the temporary mitigation we put in place to detect and bounce weave-net quickly as the situation occur until the root cause be identified.

This assume :

  • your kube-dns services is running on 10.96.0.10 53
  • the dns services is a daemon set running a pod on every nodes
  • your cluster is big enough so that statistically you have at least 90% chance to fails 3 connections to 10.96.0.10 53 in a row if your local node lose access to all other nodes's pods serving 10.96.0.10
  • That bouncing weave-net restore services. We needed to apply this modification to our system to make it so
  • The issue is not occuring at a faster rate than once per 10 minutes per nodes.
  • If you are not experiencing this issue you likely shouldn't consider running this daemonset as there is some trade off.

Basically we added a hackish liveness probe in the weave daemon set as follow :

weave-net.yml (click to expand)
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
annotations:
  cloud.weave.works/launcher-info: |-
    {
      "original-request": {
        "url": "/k8s/v1.10/net.yaml?k8s-version=Q2xpZW50IFZlcnNpb246IHZlcnNpb24uSW5mb3tNYWpvcjoiMSIsIE1pbm9yOiIxMyIsIEdpdFZlcnNpb246InYxLjEzLjUiLCBHaXRDb21taXQ6IjIxNjY5NDZmNDFiMzZkZWEyYzQ2MjZmOTBhNzc3MDZmNDI2Y2RlYTIiLCBHaXRUcmVlU3RhdGU6ImNsZWFuIiwgQnVpbGREYXRlOiIyMDE5LTAzLTI1VDE1OjI2OjUyWiIsIEdvVmVyc2lvbjoiZ28xLjExLjUiLCBDb21waWxlcjoiZ2MiLCBQbGF0Zm9ybToibGludXgvYW1kNjQifQpTZXJ2ZXIgVmVyc2lvbjogdmVyc2lvbi5JbmZve01ham9yOiIxIiwgTWlub3I6IjEzIiwgR2l0VmVyc2lvbjoidjEuMTMuNSIsIEdpdENvbW1pdDoiMjE2Njk0NmY0MWIzNmRlYTJjNDYyNmY5MGE3NzcwNmY0MjZjZGVhMiIsIEdpdFRyZWVTdGF0ZToiY2xlYW4iLCBCdWlsZERhdGU6IjIwMTktMDMtMjVUMTU6MTk6MjJaIiwgR29WZXJzaW9uOiJnbzEuMTEuNSIsIENvbXBpbGVyOiJnYyIsIFBsYXRmb3JtOiJsaW51eC9hbWQ2NCJ9Cg==",
        "date": "Mon May 06 2019 17:51:28 GMT+0000 (UTC)"
      },
      "email-address": "[email protected]"
    }
labels:
  name: weave-net
name: weave-net
namespace: kube-system
spec:
revisionHistoryLimit: 10
selector:
  matchLabels:
    name: weave-net
template:
  metadata:
    labels:
      name: weave-net
  spec:
    containers:
    - command:
      - /home/weave/launch.sh
      env:
      - name: DATE
        value: 2019.000.000
      - name: HOSTNAME
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: spec.nodeName
      image: docker.io/weaveworks/weave-kube:2.5.1
      imagePullPolicy: IfNotPresent
      name: weave
      readinessProbe:
        failureThreshold: 3
        httpGet:
          host: 127.0.0.1
          path: /status
          port: 6784
          scheme: HTTP
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      livenessProbe:
        exec:
          command:
          - /bin/sh
          - -c
          - nc 10.96.0.10 53 -z -w5 || nc 10.96.0.10 53 -z -w5 || nc 10.96.0.10 53 -z -w5 || ( rm -rf /var/lib/weave/* ; false )
        periodSeconds: 30
        timeoutSeconds: 20
        successThreshold: 1
        failureThreshold: 1
        initialDelaySeconds: 600
      resources:
        requests:
          cpu: 100m
      securityContext:
        privileged: true
        procMount: Default
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /weavedb
        name: weavedb
      - mountPath: /host/opt
        name: cni-bin
      - mountPath: /host/home
        name: cni-bin2
      - mountPath: /host/etc
        name: cni-conf
      - mountPath: /host/var/lib/dbus
        name: dbus
      - mountPath: /lib/modules
        name: lib-modules
      - mountPath: /run/xtables.lock
        name: xtables-lock
    - env:
      - name: HOSTNAME
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: spec.nodeName
      image: docker.io/weaveworks/weave-npc:2.5.1
      imagePullPolicy: IfNotPresent
      name: weave-npc
      resources:
        requests:
          cpu: 10m
      securityContext:
        privileged: true
        procMount: Default
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /run/xtables.lock
        name: xtables-lock
    dnsPolicy: ClusterFirst
    hostNetwork: true
    hostPID: true
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext:
      seLinuxOptions: {}
    serviceAccount: weave-net
    serviceAccountName: weave-net
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoSchedule
      operator: Exists
    volumes:
    - hostPath:
        path: /var/lib/weave
        type: ""
      name: weavedb
    - hostPath:
        path: /opt
        type: ""
      name: cni-bin
    - hostPath:
        path: /home
        type: ""
      name: cni-bin2
    - hostPath:
        path: /etc
        type: ""
      name: cni-conf
    - hostPath:
        path: /var/lib/dbus
        type: ""
      name: dbus
    - hostPath:
        path: /lib/modules
        type: ""
      name: lib-modules
    - hostPath:
        path: /run/xtables.lock
        type: FileOrCreate
      name: xtables-lock
updateStrategy:
  rollingUpdate:
    maxUnavailable: 1
  type: RollingUpdate

@pfcarrier
Copy link
Author

pfcarrier commented May 9, 2019

A few error in the logs that catched my attention and trying to learn more on (click to expand)
INFO: 2019/05/06 04:22:26.893968 ->[192.168.62.56:35201|b2:24:32:da:61:d3(kube-node-40)]: connection shutting down due to error: Multiple connections to b2:24:32:da:61:d3(kube-node-40) added to ee:fc:09:be:00:df(kube-node-13)

INFO: 2019/05/06 04:22:26.710028 overlay_switch ->[d6:e9:e9:56:3c:4c(kube-node-04)] sleeve timed out waiting for UDP heartbeat

INFO: 2019/05/06 04:22:26.716783 ->[192.168.62.23:35339] connection shutting down due to error during handshake: write tcp4 192.168.62.29:6783->192.168.62.23:35339: write: connection reset by peer

INFO: 2019/05/06 04:22:26.709454 ->[192.168.62.63:34327|da:ef:6b:19:cf:fe(kube-node-34)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.63:34327: write: connection reset by peer

EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)

INFO: 2019/05/06 04:22:26.940223 sleeve ->[192.168.62.52:6783|56:e4:c4:a0:8b:66(kube-node-36)]: Effective MTU verified at 1438

ERRO: 2019/05/06 06:55:11.157018 Captured frame from MAC (3a:46:08:a5:52:b2) to (ba:84:78:1f:19:c9) associated with another peer 1e:69:6c:54:6d:af(kube-node-05)

Normal to see that many line ( every minutes )
INFO: 2019/05/06 08:17:40.928440 Discovered remote MAC ba:84:78:1f:19:c9 at 2e:4e:bd:b4:a1:11(kube-node-27)

INFO: 2019/05/06 08:25:06.786421 ->[192.168.62.26:6783] connection shutting down due to error during handshake: failed to receive remote protocol header: read tcp4 192.168.62.29:60091->192.168.62.26:6783: i/o timeout

ERRO: 2019/05/06 04:16:02.039902 Captured frame from MAC (46:ac:4e:f1:3f:e4) to (9a:2f:41:96:03:b0) associated with another peer fe:1f:d3:5f:49:fd(kube-node-25)

@murali-reddy
Copy link
Contributor

@pfcarrier thanks for a detailed bug report

Weave-net pod weave-net-s4vk9 log for kube-node-13 validates your staetment VM in the cluster start to have all it's pods unable to talk to pods located on any other nodes then itself

However it does look a like that particular VM lost complete network connectivity to the rest of the nodes as indicated in the logs. If you see at timestamp 04:22:26 you would notifce weave-net pod running on kube-node-13 lost connectivity to the rest of the nodes in the cluster.

INFO: 2019/05/06 04:22:26.709454 ->[192.168.62.63:34327|da:ef:6b:19:cf:fe(kube-node-34)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.63:34327: write: connection reset by peer
INFO: 2019/05/06 04:22:26.709673 ->[192.168.62.21:6783|1e:69:6c:54:6d:af(kube-node-05)]: connection shutting down due to error: write tcp4 192.168.62.29:44935->192.168.62.21:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.709684 ->[192.168.62.52:44829|56:e4:c4:a0:8b:66(kube-node-36)]: connection shutting down due to error: read tcp4 192.168.62.29:6783->192.168.62.52:44829: i/o timeout
INFO: 2019/05/06 04:22:26.709729 ->[192.168.62.192:6783|8a:45:32:0a:c0:46(kube-node-02)]: connection shutting down due to error: write tcp4 192.168.62.29:39843->192.168.62.192:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.710795 ->[192.168.62.58:6783|ee:99:1b:17:84:cb(kube-node-42)]: connection shutting down due to error: read tcp4 192.168.62.29:52469->192.168.62.58:6783: i/o timeout
INFO: 2019/05/06 04:22:26.711342 ->[192.168.62.42:6783|7e:ed:9d:df:ce:a6(kube-node-26)]: connection shutting down due to error: read tcp4 192.168.62.29:33559->192.168.62.42:6783: i/o timeout
INFO: 2019/05/06 04:22:26.711437 ->[192.168.62.61:36979|ca:9f:2d:88:c9:fa(kube-node-14)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.61:36979: write: connection reset by peer
INFO: 2019/05/06 04:22:26.713163 ->[192.168.62.25:6783|4e:bd:fe:81:d4:73(kube-node-09)]: connection shutting down due to error: write tcp4 192.168.62.29:35855->192.168.62.25:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.713186 ->[192.168.62.193:6783|82:14:63:36:a5:de(kube-node-03)]: connection shutting down due to error: write tcp4 192.168.62.29:53557->192.168.62.193:6783: write: broken pipe
INFO: 2019/05/06 04:22:26.713247 ->[192.168.62.35:6783|ca:21:fb:33:fa:26(kube-node-19)]: connection shutting down due to error: read tcp4 192.168.62.29:34229->192.168.62.35:6783: i/o timeout
INFO: 2019/05/06 04:22:26.715431 ->[192.168.62.39:6783|42:33:fc:1e:9a:fc(kube-node-23)]: connection shutting down due to error: write tcp4 192.168.62.29:33321->192.168.62.39:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715638 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection shutting down due to error: write tcp4 192.168.62.29:43357->192.168.62.62:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715665 ->[192.168.62.57:6783|62:b7:f9:fe:10:ab(kube-node-41)]: connection shutting down due to error: read tcp4 192.168.62.29:52209->192.168.62.57:6783: i/o timeout
INFO: 2019/05/06 04:22:26.715678 ->[192.168.62.55:6783|be:91:56:9d:c7:06(kube-node-39)]: connection shutting down due to error: write tcp4 192.168.62.29:41779->192.168.62.55:6783: write: broken pipe
INFO: 2019/05/06 04:22:26.715688 ->[192.168.62.51:37685|86:96:14:d4:07:9b(kube-node-35)]: connection shutting down due to error: read tcp4 192.168.62.29:6783->192.168.62.51:37685: i/o timeout
INFO: 2019/05/06 04:22:26.715699 ->[192.168.62.34:59089|ba:24:c0:0b:ba:bd(kube-node-18)]: connection shutting down due to error: read tcp4 192.168.62.29:6783->192.168.62.34:59089: i/o timeout
INFO: 2019/05/06 04:22:26.715717 ->[192.168.62.48:55097|46:e6:49:7c:6b:88(kube-node-32)]: connection shutting down due to error: read tcp4 192.168.62.29:6783->192.168.62.48:55097: read: connection reset by peer
INFO: 2019/05/06 04:22:26.715729 ->[192.168.62.37:50773|8e:f2:46:a8:e4:3a(kube-node-21)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.37:50773: write: broken pipe
INFO: 2019/05/06 04:22:26.715744 ->[192.168.62.32:6783|7e:88:bb:7c:b9:cc(kube-node-16)]: connection shutting down due to error: write tcp4 192.168.62.29:35687->192.168.62.32:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715755 ->[192.168.62.45:6783|5e:3a:1e:47:8c:3d(kube-node-29)]: connection shutting down due to error: write tcp4 192.168.62.29:40213->192.168.62.45:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715765 ->[192.168.62.27:6783|1a:a4:36:e2:2c:44(kube-node-11)]: connection shutting down due to error: write tcp4 192.168.62.29:46655->192.168.62.27:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715787 ->[192.168.62.43:6783|2e:4e:bd:b4:a1:11(kube-node-27)]: connection shutting down due to error: write tcp4 192.168.62.29:41081->192.168.62.43:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715798 ->[192.168.62.28:6783|ca:84:00:e6:57:a9(kube-node-12)]: connection shutting down due to error: write tcp4 192.168.62.29:56087->192.168.62.28:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715807 ->[192.168.62.24:6783|d6:38:1b:fb:97:a4(kube-node-08)]: connection shutting down due to error: write tcp4 192.168.62.29:54495->192.168.62.24:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715820 ->[192.168.62.20:6783|d6:e9:e9:56:3c:4c(kube-node-04)]: connection shutting down due to error: write tcp4 192.168.62.29:50479->192.168.62.20:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715829 ->[192.168.62.41:6783|fe:1f:d3:5f:49:fd(kube-node-25)]: connection shutting down due to error: write tcp4 192.168.62.29:45333->192.168.62.41:6783: write: broken pipe
INFO: 2019/05/06 04:22:26.715848 ->[192.168.62.26:6783|a2:ea:5b:4a:72:1d(kube-node-10)]: connection shutting down due to error: write tcp4 192.168.62.29:47275->192.168.62.26:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715864 ->[192.168.62.33:56467|fe:bd:2d:6d:f4:2d(kube-node-17)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.33:56467: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715890 ->[192.168.62.44:33647|ea:11:40:96:af:24(kube-node-28)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.44:33647: write: connection reset by peer
INFO: 2019/05/06 04:22:26.715978 ->[192.168.62.31:6783|22:41:be:0d:19:ea(kube-node-15)]: connection shutting down due to error: write tcp4 192.168.62.29:35393->192.168.62.31:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716118 ->[192.168.62.47:35361|1e:e4:18:98:ce:c4(kube-node-31)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.47:35361: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716130 ->[192.168.62.38:6783|0e:ff:07:bc:7f:1f(kube-node-22)]: connection shutting down due to error: write tcp4 192.168.62.29:58419->192.168.62.38:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716151 ->[192.168.62.36:6783|66:4d:43:df:97:77(kube-node-20)]: connection shutting down due to error: write tcp4 192.168.62.29:42341->192.168.62.36:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716188 ->[192.168.62.53:6783|92:c7:36:a5:eb:75(kube-node-37)]: connection shutting down due to error: write tcp4 192.168.62.29:49735->192.168.62.53:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716194 ->[192.168.62.191:6783|ce:d4:17:3b:00:db(kube-node-01)]: connection shutting down due to error: read tcp4 192.168.62.29:52641->192.168.62.191:6783: i/o timeout
INFO: 2019/05/06 04:22:26.716318 ->[192.168.62.56:47143|b2:24:32:da:61:d3(kube-node-40)]: connection shutting down due to error: write tcp4 192.168.62.29:6783->192.168.62.56:47143: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716350 ->[192.168.62.22:6783|2a:8c:99:01:75:e3(kube-node-06)]: connection shutting down due to error: write tcp4 192.168.62.29:33587->192.168.62.22:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716359 ->[192.168.62.46:6783|de:30:16:77:64:d8(kube-node-30)]: connection shutting down due to error: write tcp4 192.168.62.29:34711->192.168.62.46:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716398 ->[192.168.62.54:6783|6a:1a:a0:47:c1:2a(kube-node-38)]: connection shutting down due to error: write tcp4 192.168.62.29:46905->192.168.62.54:6783: write: connection reset by peer
INFO: 2019/05/06 04:22:26.716468 ->[192.168.62.23:6783|be:f7:87:9e:3c:40(kube-node-07)]: connection shutting down due to error: write tcp4 192.168.62.29:59069->192.168.62.23:6783: write: connection reset by peer

As you see in the logs connections were reestablished with in milli seconds.

INFO: 2019/05/06 04:22:26.853951 ->[192.168.62.52:6783|56:e4:c4:a0:8b:66(kube-node-36)]: connection fully established
INFO: 2019/05/06 04:22:26.913808 ->[192.168.62.192:6783|8a:45:32:0a:c0:46(kube-node-02)]: connection fully established
INFO: 2019/05/06 04:22:26.918374 ->[192.168.62.21:6783|1e:69:6c:54:6d:af(kube-node-05)]: connection fully established
INFO: 2019/05/06 04:22:26.923376 ->[192.168.62.42:6783|7e:ed:9d:df:ce:a6(kube-node-26)]: connection fully established
INFO: 2019/05/06 04:22:26.929635 ->[192.168.62.63:6783|da:ef:6b:19:cf:fe(kube-node-34)]: connection fully established
INFO: 2019/05/06 04:22:26.930703 ->[192.168.62.35:6783|ca:21:fb:33:fa:26(kube-node-19)]: connection fully established
INFO: 2019/05/06 04:22:26.932533 ->[192.168.62.25:6783|4e:bd:fe:81:d4:73(kube-node-09)]: connection fully established
INFO: 2019/05/06 04:22:26.933128 ->[192.168.62.193:6783|82:14:63:36:a5:de(kube-node-03)]: connection fully established
INFO: 2019/05/06 04:22:26.935299 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection fully established
INFO: 2019/05/06 04:22:26.938072 ->[192.168.62.58:6783|ee:99:1b:17:84:cb(kube-node-42)]: connection fully established
INFO: 2019/05/06 04:22:26.940508 ->[192.168.62.61:6783|ca:9f:2d:88:c9:fa(kube-node-14)]: connection fully established
INFO: 2019/05/06 04:22:26.946358 ->[192.168.62.34:6783|ba:24:c0:0b:ba:bd(kube-node-18)]: connection fully established
INFO: 2019/05/06 04:22:26.948851 ->[192.168.62.37:6783|8e:f2:46:a8:e4:3a(kube-node-21)]: connection fully established
INFO: 2019/05/06 04:22:26.952223 ->[192.168.62.57:6783|62:b7:f9:fe:10:ab(kube-node-41)]: connection fully established
INFO: 2019/05/06 04:22:26.956854 ->[192.168.62.39:6783|42:33:fc:1e:9a:fc(kube-node-23)]: connection fully established
INFO: 2019/05/06 04:22:26.957177 ->[192.168.62.32:6783|7e:88:bb:7c:b9:cc(kube-node-16)]: connection fully established
INFO: 2019/05/06 04:22:26.957511 ->[192.168.62.51:6783|86:96:14:d4:07:9b(kube-node-35)]: connection fully established
INFO: 2019/05/06 04:22:26.957835 ->[192.168.62.28:6783|ca:84:00:e6:57:a9(kube-node-12)]: connection fully established
INFO: 2019/05/06 04:22:26.958155 ->[192.168.62.20:6783|d6:e9:e9:56:3c:4c(kube-node-04)]: connection fully established
INFO: 2019/05/06 04:22:26.958517 ->[192.168.62.27:6783|1a:a4:36:e2:2c:44(kube-node-11)]: connection fully established
INFO: 2019/05/06 04:22:26.960514 ->[192.168.62.26:6783|a2:ea:5b:4a:72:1d(kube-node-10)]: connection fully established
INFO: 2019/05/06 04:22:26.961467 ->[192.168.62.41:6783|fe:1f:d3:5f:49:fd(kube-node-25)]: connection fully established
INFO: 2019/05/06 04:22:26.962266 ->[192.168.62.55:6783|be:91:56:9d:c7:06(kube-node-39)]: connection fully established
INFO: 2019/05/06 04:22:26.964483 ->[192.168.62.24:6783|d6:38:1b:fb:97:a4(kube-node-08)]: connection fully established
INFO: 2019/05/06 04:22:26.966125 ->[192.168.62.53:6783|92:c7:36:a5:eb:75(kube-node-37)]: connection fully established
INFO: 2019/05/06 04:22:26.966627 ->[192.168.62.45:6783|5e:3a:1e:47:8c:3d(kube-node-29)]: connection fully established
INFO: 2019/05/06 04:22:26.995315 ->[192.168.62.43:6783|2e:4e:bd:b4:a1:11(kube-node-27)]: connection fully established
INFO: 2019/05/06 04:22:26.996434 ->[192.168.62.36:6783|66:4d:43:df:97:77(kube-node-20)]: connection fully established
INFO: 2019/05/06 04:22:26.996773 ->[192.168.62.44:6783|ea:11:40:96:af:24(kube-node-28)]: connection fully established
INFO: 2019/05/06 04:22:26.998180 ->[192.168.62.38:6783|0e:ff:07:bc:7f:1f(kube-node-22)]: connection fully established
INFO: 2019/05/06 04:22:27.000343 ->[192.168.62.31:6783|22:41:be:0d:19:ea(kube-node-15)]: connection fully established
INFO: 2019/05/06 04:22:27.003212 ->[192.168.62.47:6783|1e:e4:18:98:ce:c4(kube-node-31)]: connection fully established
INFO: 2019/05/06 04:22:27.004826 ->[192.168.62.191:6783|ce:d4:17:3b:00:db(kube-node-01)]: connection fully established
INFO: 2019/05/06 04:22:27.006018 ->[192.168.62.23:6783|be:f7:87:9e:3c:40(kube-node-07)]: connection fully established
INFO: 2019/05/06 04:22:27.010163 ->[192.168.62.49:6783|96:13:be:0c:ac:e5(kube-node-33)]: connection fully established
INFO: 2019/05/06 04:22:27.020585 ->[192.168.62.33:6783|fe:bd:2d:6d:f4:2d(kube-node-17)]: connection fully established
INFO: 2019/05/06 04:22:27.022942 ->[192.168.62.54:6783|6a:1a:a0:47:c1:2a(kube-node-38)]: connection fully established
INFO: 2019/05/06 04:22:27.025397 ->[192.168.62.46:6783|de:30:16:77:64:d8(kube-node-30)]: connection fully established
INFO: 2019/05/06 04:22:27.025944 ->[192.168.62.22:6783|2a:8c:99:01:75:e3(kube-node-06)]: connection fully established
INFO: 2019/05/06 04:22:27.034826 ->[192.168.62.48:44281|46:e6:49:7c:6b:88(kube-node-32)]: connection fully established
INFO: 2019/05/06 04:22:27.036805 ->[192.168.62.34:37199|ba:24:c0:0b:ba:bd(kube-node-18)]: connection fully established

Clearly there is temporary network glitch for that VM, that might have been caused by VMware environment or network.

Reaching out for guidance on pinpointing things and understanding weave log and confirm/infirm if this is an expected failure mode.

If there is underlay network connectivity issues, this is going to impact overlay connectivity. Loosing pod-to-pod connectivity is expected. But do you have any insight into how long did it took to recover (pod-to-pod connectivity)?

There are couple of weave-net issue that are visible in the logs.

  • for some reason fastdp can no longer be used so this node fall back to sleeve mode with the rest of the cluster nodes. As you see from the weave-net-bp72h's weave status connections rest of the cluster is using fastdp, just this node is using sleeve mode. there should have been attempt to try fastdp mode again. It might have failed to use fastdp
  • there are connection shutting down due to error: Multiple connections to during recovery tim when this incident happened
INFO: 2019/05/06 04:22:26.893968 ->[192.168.62.56:35201|b2:24:32:da:61:d3(kube-node-40)]: connection shutting down due to error: Multiple connections to b2:24:32:da:61:d3(kube-node-40) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.101064 ->[192.168.62.33:49791|fe:bd:2d:6d:f4:2d(kube-node-17)]: connection shutting down due to error: Multiple connections to fe:bd:2d:6d:f4:2d(kube-node-17) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.201532 ->[192.168.62.61:60159|ca:9f:2d:88:c9:fa(kube-node-14)]: connection shutting down due to error: Multiple connections to ca:9f:2d:88:c9:fa(kube-node-14) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.301433 ->[192.168.62.35:6783|ca:21:fb:33:fa:26(kube-node-19)]: connection shutting down due to error: Multiple connections to ca:21:fb:33:fa:26(kube-node-19) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.401416 ->[192.168.62.192:6783|8a:45:32:0a:c0:46(kube-node-02)]: connection shutting down due to error: Multiple connections to 8a:45:32:0a:c0:46(kube-node-02) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.501258 ->[192.168.62.45:6783|5e:3a:1e:47:8c:3d(kube-node-29)]: connection shutting down due to error: Multiple connections to 5e:3a:1e:47:8c:3d(kube-node-29) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.601508 ->[192.168.62.27:6783|1a:a4:36:e2:2c:44(kube-node-11)]: connection shutting down due to error: Multiple connections to 1a:a4:36:e2:2c:44(kube-node-11) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.701575 ->[192.168.62.55:51147|be:91:56:9d:c7:06(kube-node-39)]: connection shutting down due to error: Multiple connections to be:91:56:9d:c7:06(kube-node-39) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.801558 ->[192.168.62.36:6783|66:4d:43:df:97:77(kube-node-20)]: connection shutting down due to error: Multiple connections to 66:4d:43:df:97:77(kube-node-20) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:27.901621 ->[192.168.62.58:6783|ee:99:1b:17:84:cb(kube-node-42)]: connection shutting down due to error: Multiple connections to ee:99:1b:17:84:cb(kube-node-42) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.002912 ->[192.168.62.23:6783|be:f7:87:9e:3c:40(kube-node-07)]: connection shutting down due to error: Multiple connections to be:f7:87:9e:3c:40(kube-node-07) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.101431 ->[192.168.62.20:46695|d6:e9:e9:56:3c:4c(kube-node-04)]: connection shutting down due to error: Multiple connections to d6:e9:e9:56:3c:4c(kube-node-04) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.201318 ->[192.168.62.24:6783|d6:38:1b:fb:97:a4(kube-node-08)]: connection shutting down due to error: Multiple connections to d6:38:1b:fb:97:a4(kube-node-08) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.301142 ->[192.168.62.43:6783|2e:4e:bd:b4:a1:11(kube-node-27)]: connection shutting down due to error: Multiple connections to 2e:4e:bd:b4:a1:11(kube-node-27) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.401191 ->[192.168.62.38:37441|0e:ff:07:bc:7f:1f(kube-node-22)]: connection shutting down due to error: Multiple connections to 0e:ff:07:bc:7f:1f(kube-node-22) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.502221 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection shutting down due to error: Multiple connections to 06:07:7f:a1:ae:96(kube-node-24) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.601167 ->[192.168.62.193:33807|82:14:63:36:a5:de(kube-node-03)]: connection shutting down due to error: Multiple connections to 82:14:63:36:a5:de(kube-node-03) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.701304 ->[192.168.62.22:37891|2a:8c:99:01:75:e3(kube-node-06)]: connection shutting down due to error: Multiple connections to 2a:8c:99:01:75:e3(kube-node-06) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.801417 ->[192.168.62.21:6783|1e:69:6c:54:6d:af(kube-node-05)]: connection shutting down due to error: Multiple connections to 1e:69:6c:54:6d:af(kube-node-05) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:28.901632 ->[192.168.62.41:6783|fe:1f:d3:5f:49:fd(kube-node-25)]: connection shutting down due to error: Multiple connections to fe:1f:d3:5f:49:fd(kube-node-25) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.001087 ->[192.168.62.32:56901|7e:88:bb:7c:b9:cc(kube-node-16)]: connection shutting down due to error: Multiple connections to 7e:88:bb:7c:b9:cc(kube-node-16) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.101348 ->[192.168.62.53:6783|92:c7:36:a5:eb:75(kube-node-37)]: connection shutting down due to error: Multiple connections to 92:c7:36:a5:eb:75(kube-node-37) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.201192 ->[192.168.62.28:6783|ca:84:00:e6:57:a9(kube-node-12)]: connection shutting down due to error: Multiple connections to ca:84:00:e6:57:a9(kube-node-12) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.301247 ->[192.168.62.31:6783|22:41:be:0d:19:ea(kube-node-15)]: connection shutting down due to error: Multiple connections to 22:41:be:0d:19:ea(kube-node-15) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.401852 ->[192.168.62.51:6783|86:96:14:d4:07:9b(kube-node-35)]: connection shutting down due to error: Multiple connections to 86:96:14:d4:07:9b(kube-node-35) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.407752 ->[192.168.62.51:43687|86:96:14:d4:07:9b(kube-node-35)]: connection shutting down due to error: Multiple connections to 86:96:14:d4:07:9b(kube-node-35) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.501534 ->[192.168.62.44:6783|ea:11:40:96:af:24(kube-node-28)]: connection shutting down due to error: Multiple connections to ea:11:40:96:af:24(kube-node-28) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.601471 ->[192.168.62.63:51423|da:ef:6b:19:cf:fe(kube-node-34)]: connection shutting down due to error: Multiple connections to da:ef:6b:19:cf:fe(kube-node-34) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.701216 ->[192.168.62.191:6783|ce:d4:17:3b:00:db(kube-node-01)]: connection shutting down due to error: Multiple connections to ce:d4:17:3b:00:db(kube-node-01) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.801309 ->[192.168.62.25:6783|4e:bd:fe:81:d4:73(kube-node-09)]: connection shutting down due to error: Multiple connections to 4e:bd:fe:81:d4:73(kube-node-09) added to ee:fc:09:be:00:df(kube-node-13)
INFO: 2019/05/06 04:22:29.901468 ->[192.168.62.42:37181|7e:ed:9d:df:ce:a6(kube-node-26)]: connection shutting down due to error: Multiple connections to 7e:ed:9d:df:ce:a6(kube-node-26) added to ee:fc:09:be:00:df(kube-node-13)

Would you able to enable debug logging for weave-net and share the logs when you notice this behaviour again?

The closest I got was by running iptables command and drop udp traffic in/out for port 6784 udp.

I will give it a try to see if I can reproduce similar symptoms

@pfcarrier
Copy link
Author

Thanks so much for looking into it @murali-reddy

I will definitely enable debug logging. This will take me a couple of day to wait for a quiet period so I can remove the workaround and capture all logs without a restart occuring.

do you have any insight into how long did it took to recover (pod-to-pod connectivity)?

This is the thing that puzzle me, for as long as this node was on sleeve pod-to-pod connectivity never recovered despite all tcp connection being reported as established in weave status connections output ; it is only when I manually deleted weave-net-s4vk9 pod hours later that connectivity came back ( after the restart it ended up using fastdp ). This make me think that maybe there is something in that environment that prevent sleeve to be functional.

I will try to confirm/infirm that by setting WEAVE_NO_FASTDP=true and see if sleeve is working as expected in a cluster running in that same vlan. I am curious, is there an imperative way to instruct weave to fall back to sleeve on a given node ? Through a kill signal or an api call, or maybe the env var is the only way ?

Took the general idea from #2354 ; I will further try to pinpoint what happen to the packets when pod-to-pod connectivity is lost using inspiration from that thread.

@pfcarrier
Copy link
Author

Clearly there is temporary network glitch for that VM, that might have been caused by VMware environment or network.

I wholeheartedly agree that everything in the logs point toward that.

@murali-reddy
Copy link
Contributor

I will try to confirm/infirm that by setting WEAVE_NO_FASTDP=true and see if sleeve is working as expected in a cluster running in that same vlan. I am curious, is there an imperative way to instruct weave to fall back to sleeve on a given node ? Through a kill signal or an api call, or maybe the env var is the only way ?

Yes, setting the environment variable is the only way.

as long as this node was on sleeve pod-to-pod connectivity never recovered despite all tcp connection

Any errors in the logs when sleeve mode is active?

@murali-reddy
Copy link
Contributor

The closest I got was by running iptables command and drop udp traffic in/out for port 6784 udp.The closest I got was by running iptables command and drop udp traffic in/out for port 6784 udp.

This did not reproduce the issue for me. Once connections are dropped and re-established between the peers, they all ends up using fastdp as expected. Even if sleeve is used there is logic to retry and use fastdp. There is some thing in your environment that prevented happening both. Debug level logs would help.

@pfcarrier
Copy link
Author

Thanks so much for your assistance @murali-reddy, apologies for the delay.

Any errors in the logs when sleeve mode is active?

Nothing that I can spot. As far as I can tell logs indicate that everything is going fine. E.g. in the original pod log weave-net-s4vk9 all the event after 4:23 up to the end of the log ( about 8 hours ) is for the time period where that pod was operating in sleeve mode but inter-pod networking completely down for the whole 8 hours until restarted.

Workload on the cluster still prevent me to remove the work around that restart weave-net pod about 30 sec after inter-pod connectivity is lost. Still waiting for a windows to proceed to that ; in the meantime I rolled out the activation of debug mode and captured the logs up to the point of restarts.

With debug enabled I tracked 4 events of pod automatically restarting due to lost of inter-pod communication ; for each events I included logs since 4am in the morning as well as the logs after the restart is triggered.

Finally I included a 5th pod that didn't restart so we can observe it's visibility of the events. One can search in the logs files for a string ====POD_RESTARTED==== to get to it. This is a bit anticlimactic as up to before the restart I don't notice any red flag in weave-net pod for the pods that end up restarting. Hopefully I missed something.

I also include the relevant logs output for each nodes, alas all we can witness in them is the liveness probe detecting lost of inter-pod communication and triggering a restart.

Global

kubectl get pod -owide -nkube-system | grep weave-net (click to expand)
weave-net-29lqf                                2/2     Running   0          21h     192.168.62.23    kube-node-07   <none>           <none>
weave-net-2wf5w                                2/2     Running   0          20h     192.168.62.44    kube-node-28   <none>           <none>
weave-net-4flws                                2/2     Running   0          20h     192.168.62.33    kube-node-17   <none>           <none>
weave-net-5wdjn                                2/2     Running   0          21h     192.168.62.52    kube-node-36   <none>           <none>
weave-net-62bsc                                2/2     Running   0          21h     192.168.62.45    kube-node-29   <none>           <none>
weave-net-62d7x                                2/2     Running   0          21h     192.168.62.51    kube-node-35   <none>           <none>
weave-net-6rtsk                                2/2     Running   0          20h     192.168.62.48    kube-node-32   <none>           <none>
weave-net-6wqgb                                2/2     Running   0          21h     192.168.62.22    kube-node-06   <none>           <none>
weave-net-8kkpm                                2/2     Running   1          20h     192.168.62.32    kube-node-16   <none>           <none>
weave-net-chzt5                                2/2     Running   0          21h     192.168.62.191   kube-node-01   <none>           <none>
weave-net-d82hk                                2/2     Running   0          21h     192.168.62.47    kube-node-31   <none>           <none>
weave-net-fqvtw                                2/2     Running   0          21h     192.168.62.31    kube-node-15   <none>           <none>
weave-net-h4xkz                                2/2     Running   0          20h     192.168.62.63    kube-node-34   <none>           <none>
weave-net-h7wtm                                2/2     Running   0          21h     192.168.62.34    kube-node-18   <none>           <none>
weave-net-hvq75                                2/2     Running   0          20h     192.168.62.21    kube-node-05   <none>           <none>
weave-net-jl48c                                2/2     Running   0          20h     192.168.62.24    kube-node-08   <none>           <none>
weave-net-jq7n4                                2/2     Running   0          20h     192.168.62.26    kube-node-10   <none>           <none>
weave-net-k5rz2                                2/2     Running   0          20h     192.168.62.28    kube-node-12   <none>           <none>
weave-net-k92ll                                2/2     Running   0          20h     192.168.62.56    kube-node-40   <none>           <none>
weave-net-kdxx5                                2/2     Running   0          20h     192.168.62.53    kube-node-37   <none>           <none>
weave-net-klnp2                                2/2     Running   1          20h     192.168.62.39    kube-node-23   <none>           <none>
weave-net-l4j8l                                2/2     Running   0          21h     192.168.62.42    kube-node-26   <none>           <none>
weave-net-l4t2w                                2/2     Running   0          21h     192.168.62.54    kube-node-38   <none>           <none>
weave-net-mbr5d                                2/2     Running   0          21h     192.168.62.29    kube-node-13   <none>           <none>
weave-net-mc4xl                                2/2     Running   0          20h     192.168.62.37    kube-node-21   <none>           <none>
weave-net-mgj88                                2/2     Running   0          21h     192.168.62.57    kube-node-41   <none>           <none>
weave-net-mjxw6                                2/2     Running   0          21h     192.168.62.35    kube-node-19   <none>           <none>
weave-net-ndldh                                2/2     Running   1          20h     192.168.62.58    kube-node-42   <none>           <none>
weave-net-nh5ts                                2/2     Running   0          20h     192.168.62.55    kube-node-39   <none>           <none>
weave-net-nv42m                                2/2     Running   0          21h     192.168.62.20    kube-node-04   <none>           <none>
weave-net-p8xz9                                2/2     Running   0          21h     192.168.62.41    kube-node-25   <none>           <none>
weave-net-pw55b                                2/2     Running   0          20h     192.168.62.38    kube-node-22   <none>           <none>
weave-net-q48rg                                2/2     Running   1          20h     192.168.62.49    kube-node-33   <none>           <none>
weave-net-rtr65                                2/2     Running   0          21h     192.168.62.43    kube-node-27   <none>           <none>
weave-net-s8qvl                                2/2     Running   0          20h     192.168.62.46    kube-node-30   <none>           <none>
weave-net-vjrvd                                2/2     Running   1          20h     192.168.62.25    kube-node-09   <none>           <none>
weave-net-vxvrs                                2/2     Running   0          21h     192.168.62.193   kube-node-03   <none>           <none>
weave-net-vzcrb                                2/2     Running   1          20h     192.168.62.61    kube-node-14   <none>           <none>
weave-net-vztjv                                2/2     Running   0          21h     192.168.62.192   kube-node-02   <none>           <none>
weave-net-xv7pl                                2/2     Running   1          20h     192.168.62.62    kube-node-24   <none>           <none>
weave-net-z5dq6                                2/2     Running   0          20h     192.168.62.27    kube-node-11   <none>           <none>
weave-net-z6lhg                                2/2     Running   1          20h     192.168.62.36    kube-node-20   <none>           <none>
kubectl get nodes (click to expand)
NAME                   STATUS   ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
kube-node-01   Ready    master   17d   v1.13.5   192.168.62.191   <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-02   Ready    <none>   17d   v1.13.5   192.168.62.192   <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-03   Ready    <none>   17d   v1.13.5   192.168.62.193   <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-04   Ready    <none>   17d   v1.13.5   192.168.62.20    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-05   Ready    <none>   17d   v1.13.5   192.168.62.21    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-06   Ready    <none>   17d   v1.13.5   192.168.62.22    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-07   Ready    <none>   17d   v1.13.5   192.168.62.23    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-08   Ready    <none>   17d   v1.13.5   192.168.62.24    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-09   Ready    <none>   17d   v1.13.5   192.168.62.25    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-10   Ready    <none>   17d   v1.13.5   192.168.62.26    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-11   Ready    <none>   17d   v1.13.5   192.168.62.27    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-12   Ready    <none>   17d   v1.13.5   192.168.62.28    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-13   Ready    <none>   17d   v1.13.5   192.168.62.29    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-14   Ready    <none>   17d   v1.13.5   192.168.62.61    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-15   Ready    <none>   17d   v1.13.5   192.168.62.31    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-16   Ready    <none>   17d   v1.13.5   192.168.62.32    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-17   Ready    <none>   17d   v1.13.5   192.168.62.33    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-18   Ready    <none>   17d   v1.13.5   192.168.62.34    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-19   Ready    <none>   17d   v1.13.5   192.168.62.35    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-20   Ready    <none>   17d   v1.13.5   192.168.62.36    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-21   Ready    <none>   17d   v1.13.5   192.168.62.37    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-22   Ready    <none>   17d   v1.13.5   192.168.62.38    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-23   Ready    <none>   17d   v1.13.5   192.168.62.39    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-24   Ready    <none>   17d   v1.13.5   192.168.62.62    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-25   Ready    <none>   17d   v1.13.5   192.168.62.41    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-26   Ready    <none>   17d   v1.13.5   192.168.62.42    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-27   Ready    <none>   17d   v1.13.5   192.168.62.43    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-28   Ready    <none>   17d   v1.13.5   192.168.62.44    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-29   Ready    <none>   17d   v1.13.5   192.168.62.45    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-30   Ready    <none>   17d   v1.13.5   192.168.62.46    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-31   Ready    <none>   17d   v1.13.5   192.168.62.47    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-32   Ready    <none>   17d   v1.13.5   192.168.62.48    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-33   Ready    <none>   17d   v1.13.5   192.168.62.49    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-34   Ready    <none>   17d   v1.13.5   192.168.62.63    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-35   Ready    <none>   17d   v1.13.5   192.168.62.51    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-36   Ready    <none>   17d   v1.13.5   192.168.62.52    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-37   Ready    <none>   17d   v1.13.5   192.168.62.53    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-38   Ready    <none>   17d   v1.13.5   192.168.62.54    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-39   Ready    <none>   17d   v1.13.5   192.168.62.55    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-40   Ready    <none>   17d   v1.13.5   192.168.62.56    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-41   Ready    <none>   17d   v1.13.5   192.168.62.57    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3
kube-node-42   Ready    <none>   17d   v1.13.5   192.168.62.58    <none>        Debian GNU/Linux 9 (stretch)   4.9.0-5-amd64    docker://17.3.3

Logs

Debug enabled logs for weave-net pods :

  • weave-net-vjrvd 192.168.62.25 kube-node-09 -- Last event in logs before restart 07:54:39
  • weave-net-vzcrb 192.168.62.61 kube-node-14 -- Last event in logs before restart 13:00:13
  • weave-net-q48rg 192.168.62.49 kube-node-33 -- Last event in logs before restart 15:51:15
  • weave-net-nh5ts 192.168.62.55 kube-node-39 -- No restart -- allow to see it's visibility on other pods restart
  • weave-net-ndldh 192.168.62.58 kube-node-42 -- Last event in logs before restart 12:24:39

Dmesg and other command output for each nodes :

kube-node-09 (click to expand)
+ dmesg -T
[Wed May 15 07:54:10 2019] vxlan: Cannot bind port 49440, err=-98
[Wed May 15 07:54:10 2019] weave: port 13(vethwedu) entered blocking state
[Wed May 15 07:54:10 2019] weave: port 13(vethwedu) entered disabled state
[Wed May 15 07:54:10 2019] device vethwedu entered promiscuous mode
[Wed May 15 07:54:10 2019] device vethwedu left promiscuous mode
[Wed May 15 07:54:10 2019] weave: port 13(vethwedu) entered disabled state
[Wed May 15 07:54:11 2019] device vxlan-6784 left promiscuous mode
[Wed May 15 07:54:11 2019] device vxlan-6784 entered promiscuous mode

+ journalctl -u docker.service --no-pager
May 15 07:54:39 kube-node-09 dockerd[808]: time="2019-05-15T03:54:39.142889718-04:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused \"process_linux.go:87: adding pid 57049 to cgroups caused \\\"failed to write 57049 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podbfc1e4c5-7676-11e9-a8c4-0050568733bc/8dccc1f2d608c34cbb0e0ecc0b8ee361adbdfff39778ab54ae6c4a2ee4707023/cgroup.procs: invalid argument\\\"\"\n"
May 15 07:54:39 kube-node-09 dockerd[808]: time="2019-05-15T03:54:39.935735725-04:00" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 4e085481366ea37e420f6161efecbe5008a680b30b191d72fb75a9696856b01a"


+ journalctl -u kubelet --no-pager
no event for the time windows

+ netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 88269311      0      0 0         32613      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 1010871352      0  16571 0      695149922      0      0      0 BMRU
lo       65536  1612577      0      0 0       1612577      0      0      0 LRU
vethwe-b  1376 114920207      0      0 0      30898466      0      0      0 BMRU
vethwe-d  1376 30898466      0      0 0      114920207      0      0      0 BMRU
vethwepl  1376   158235      0      0 0      88435404      0      0      0 BMRU
vethwepl  1376  6652274      0      0 0      64051627      0      0      0 BMRU
vethwepl  1376 37075124      0      0 0      64671297      0      0      0 BMRU
vethwepl  1376  1995691      0      0 0      33150031      0      0      0 BMRU
vethwepl  1376   855137      0      0 0      32687114      0      0      0 BMRU
vethwepl  1376  1143563      0      0 0      32423429      0      0      0 BMRU
vethwepl  1376 45927332      0      0 0      100503173      0      0      0 BMRU
vethwepl  1376  2401431      0      0 0      66673079      0      0      0 BMRU
vethwepl  1376  4502998      0      0 0      71825698      0      0      0 BMRU
vethwepl  1376 37049646      0      0 0      64646374      0      0      0 BMRU
vethwepl  1376  2170125      0      0 0      33260329      0      0      0 BMRU
vethwepl  1376  2183305      0      0 0      33264875      0      0      0 BMRU
vethwepl  1376  1996672      0      0 0      33165673      0      0      0 BMRU
vethwepl  1376  2044153      0      0 0      33129098      0      0      0 BMRU
vethwepl  1376   815458      0      0 0      32613135      0      0      0 BMRU
vethwepl  1376  3540880      0      0 0      34153121      0      0      0 BMRU
vethwepl  1376 13914069      0      0 0      45574201      0      0      0 BMRU
vethwepl  1376   467007      0      0 0      67796526      0      0      0 BMRU
vethwepl  1376   730820      0      0 0      58091687      0      0      0 BMRU
vethwepl  1376  8105727      0      0 0      97243436      0      0      0 BMRU
vethwepl  1376 19165108      0      0 0      76218767      0      0      0 BMRU
vethwepl  1376  5436026      0      0 0      64588947      0      0      0 BMRU
vethwepl  1376  2597811      0      0 0      59822570      0      0      0 BMRU
vethwepl  1376   840660      0      0 0      32653907      0      0      0 BMRU
vethwepl  1376  5961421      0      0 0      37158879      0      0      0 BMRU
vxlan-67 65485 20653410      0      0 0      37914026      0    123      0 BMRU
weave     1376 588539033      0      0 0      473914751      0      0      0 BMRU

kube-node-14 (click to expand)
+ dmesg -T

[Wed May 15 13:00:48 2019] vxlan: Cannot bind port 52583, err=-98
[Wed May 15 13:00:48 2019] weave: port 11(vethwedu) entered blocking state
[Wed May 15 13:00:48 2019] weave: port 11(vethwedu) entered disabled state
[Wed May 15 13:00:48 2019] device vethwedu entered promiscuous mode
[Wed May 15 13:00:48 2019] device vethwedu left promiscuous mode
[Wed May 15 13:00:48 2019] weave: port 11(vethwedu) entered disabled state
[Wed May 15 13:00:49 2019] device vxlan-6784 left promiscuous mode
[Wed May 15 13:00:49 2019] device vxlan-6784 entered promiscuous mode

+ journalctl -u docker.service --no-pager

May 15 13:01:06 kube-node-14 dockerd[816]: time="2019-05-15T09:01:06.476721927-04:00" level=info msg="Container 3ef224fcc29d6a22c3a4465611f26f7f0ee865f69526d06822c457b1100ed11b failed to exit within 30 seconds of signal 15 - using the force"
May 15 13:01:06 kube-node-14 dockerd[816]: time="2019-05-15T09:01:06.478392010-04:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container 3ef224fcc29d6a22c3a4465611f26f7f0ee865f69526d06822c457b1100ed11b: rpc error: code = 2 desc = no such process"
May 15 13:01:10 kube-node-14 dockerd[816]: time="2019-05-15T09:01:10.096820715-04:00" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container c990ac1b5853bb55d8d8d2faaf7b0d4c4fe01f2f6150cf39b5cfce3862513271"
May 15 13:48:51 kube-node-14 dockerd[816]: time="2019-05-15T09:48:51.468116688-04:00" level=warning msg="containerd: unable to save c990ac1b5853bb55d8d8d2faaf7b0d4c4fe01f2f6150cf39b5cfce3862513271:08b4727053005f522b9ca0090a2a5752b69ba8e76000ad0ff862120bfae97063 starttime: read /proc/49085/stat: no such process"
May 15 13:35:51 kube-node-14 dockerd[816]: time="2019-05-15T10:35:51.476756301-04:00" level=warning msg="containerd: unable to save c990ac1b5853bb55d8d8d2faaf7b0d4c4fe01f2f6150cf39b5cfce3862513271:0e618d14f0516a3eab32db48c64bfc25c4479d049ace4f6c73231c2791170560 starttime: read /proc/56494/stat: no such process"
May 15 13:39:21 kube-node-14 dockerd[816]: time="2019-05-15T10:39:21.470588197-04:00" level=warning msg="containerd: unable to save c990ac1b5853bb55d8d8d2faaf7b0d4c4fe01f2f6150cf39b5cfce3862513271:100e43e5a099b305f0258a0c0a697579039460e1508d2e7fe137bb269a5d8546 starttime: read /proc/3101/stat: no such process"

+ journalctl -u kubelet --no-pager
no output

+ netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 79871878      0      0 0          3206      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 749975275      0  15507 0      454142980      0      0      0 BMRU
lo       65536  1508630      0      0 0       1508630      0      0      0 LRU
vethwe-b  1376 91048210      0      0 0      14180747      0      0      0 BMRU
vethwe-d  1376 14180747      0      0 0      91048210      0      0      0 BMRU
vethwepl  1376   516079      0      0 0      23785113      0      0      0 BMRU
vethwepl  1376 14021048      0      0 0      62848840      0      0      0 BMRU
vethwepl  1376   412910      0      0 0      59353163      0      0      0 BMRU
vethwepl  1376   590589      0      0 0      23502492      0      0      0 BMRU
vethwepl  1376   623139      0      0 0      23535332      0      0      0 BMRU
vethwepl  1376  2350321      0      0 0      51234270      0      0      0 BMRU
vethwepl  1376   148309      0      0 0      80017721      0      0      0 BMRU
vethwepl  1376 33482254      0      0 0      80516577      0      0      0 BMRU
vethwepl  1376 14074779      0      0 0      62895789      0      0      0 BMRU
vethwepl  1376  7263673      0      0 0      87906753      0      0      0 BMRU
vethwepl  1376   502122      0      0 0      23760831      0      0      0 BMRU
vethwepl  1376 18090397      0      0 0      65627227      0      0      0 BMRU
vethwepl  1376   324237      0      0 0      23277438      0      0      0 BMRU
vethwepl  1376  4494573      0      0 0      54818917      0      0      0 BMRU
vethwepl  1376  1838170      0      0 0      50724322      0      0      0 BMRU
vethwepl  1376  4955970      0      0 0      53806085      0      0      0 BMRU
vethwepl  1376  3348358      0      0 0      26013948      0      0      0 BMRU
vethwepl  1376  3968689      0      0 0      62912267      0      0      0 BMRU
vethwepl  1376   627000      0      0 0      49625135      0      0      0 BMRU
vethwepl  1376   851190      0      0 0      23747070      0      0      0 BMRU
vethwepl  1376  1839227      0      0 0      57826231      0      0      0 BMRU
vethwepl  1376 26517516      0      0 0      47179625      0      0      0 BMRU
vxlan-67 65485   999605      0      0 0        774604      0      0      0 BMRU
weave     1376 415821577      0      0 0      318040715      0      0      0 BMRU

kube-node-33 (click to expand)
+ dmesg -T
[Wed May 15 15:51:41 2019] vxlan: Cannot bind port 43457, err=-98
[Wed May 15 15:51:41 2019] weave: port 8(vethwedu) entered blocking state
[Wed May 15 15:51:41 2019] weave: port 8(vethwedu) entered disabled state
[Wed May 15 15:51:41 2019] device vethwedu entered promiscuous mode
[Wed May 15 15:51:41 2019] device vethwedu left promiscuous mode
[Wed May 15 15:51:41 2019] weave: port 8(vethwedu) entered disabled state
[Wed May 15 15:51:42 2019] device vxlan-6784 left promiscuous mode
[Wed May 15 15:51:42 2019] device vxlan-6784 entered promiscuous mode

+ journalctl -u docker.service --no-pager
May 15 15:52:04 kube-node-33 dockerd[810]: time="2019-05-15T11:52:04.118003139-04:00" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container b87c5a2141d67a61333106bafa3b7f60cd0ced35ac69f447e3f61fab7879ae41"

+ journalctl -u kubelet --no-pager
no event around the event

+ netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 88291704      0      0 0         18462      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 854406813      0  16831 0      534770722      0      0      0 BMRU
lo       65536  1619996      0      0 0       1619996      0      0      0 LRU
vethwe-b  1376 110214059      0      0 0      25781251      0      0      0 BMRU
vethwe-d  1376 25781251      0      0 0      110214059      0      0      0 BMRU
vethwepl  1376   158249      0      0 0      88451636      0      0      0 BMRU
vethwepl  1376 44472275      0      0 0      99445867      0      0      0 BMRU
vethwepl  1376   468002      0      0 0      67812817      0      0      0 BMRU
vethwepl  1376   650481      0      0 0      32329285      0      0      0 BMRU
vethwepl  1376  2189865      0      0 0      33290890      0      0      0 BMRU
vethwepl  1376  2032217      0      0 0      33145280      0      0      0 BMRU
vethwepl  1376  1171962      0      0 0      32461631      0      0      0 BMRU
vethwepl  1376  5387207      0      0 0      64239152      0      0      0 BMRU
vethwepl  1376  3553015      0      0 0      34173583      0      0      0 BMRU
vethwepl  1376  1165906      0      0 0      32459775      0      0      0 BMRU
vethwepl  1376 18783248      0      0 0      76006217      0      0      0 BMRU
vethwepl  1376  1124687      0      0 0      58474252      0      0      0 BMRU
vethwepl  1376  6426612      0      0 0      63855383      0      0      0 BMRU
vethwepl  1376  1760514      0      0 0      66154344      0      0      0 BMRU
vethwepl  1376  2554199      0      0 0      59816682      0      0      0 BMRU
vethwepl  1376  5325204      0      0 0      72652956      0      0      0 BMRU
vethwepl  1376   856305      0      0 0      32729077      0      0      0 BMRU
vethwepl  1376  8107128      0      0 0      96727505      0      0      0 BMRU
vethwepl  1376   708302      0      0 0      32430217      0      0      0 BMRU
vethwepl  1376  1137318      0      0 0      32431289      0      0      0 BMRU
vethwepl  1376   726757      0      0 0      58105452      0      0      0 BMRU
vethwepl  1376   727372      0      0 0      58102171      0      0      0 BMRU
vxlan-67 65485 10032340      0      0 0      15590336      0      0      0 BMRU
weave     1376 457116686      0      0 0      352001375      0      0      0 BMRU

kube-node-39 (click to expand)
+ dmesg -T
nothing for the time period

+ journalctl -u docker.service --no-pager
nothing for the time period

+ journalctl -u kubelet --no-pager
nothing for the time period

+ netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 79886962      0      0 0         26172      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 930718635      0  15464 0      570994050      0      0      0 BMRU
lo       65536  1519085      0      0 0       1519085      0      0      0 LRU
vethwe-b  1376 103136660      0      0 0      27462075      0      0      0 BMRU
vethwe-d  1376 27462075      0      0 0      103136660      0      0      0 BMRU
vethwepl  1376   843463      0      0 0      23753229      0      0      0 BMRU
vethwepl  1376  2168752      0      0 0      58103673      0      0      0 BMRU
vethwepl  1376   148319      0      0 0      80037968      0      0      0 BMRU
vethwepl  1376  3891077      0      0 0      26806178      0      0      0 BMRU
vethwepl  1376 12342121      0      0 0      60814339      0      0      0 BMRU
vethwepl  1376  1563639      0      0 0      24346219      0      0      0 BMRU
vethwepl  1376  3367423      0      0 0      26048174      0      0      0 BMRU
vethwepl  1376  3971763      0      0 0      62933622      0      0      0 BMRU
vethwepl  1376  1016891      0      0 0      50205837      0      0      0 BMRU
vethwepl  1376 13888486      0      0 0      62529007      0      0      0 BMRU
vethwepl  1376 18058471      0      0 0      65493654      0      0      0 BMRU
vethwepl  1376  1136010      0      0 0      50138662      0      0      0 BMRU
vethwepl  1376  4966818      0      0 0      53798811      0      0      0 BMRU
vethwepl  1376   369095      0      0 0      23337267      0      0      0 BMRU
vethwepl  1376  5378169      0      0 0      56076274      0      0      0 BMRU
vethwepl  1376   613572      0      0 0      23992982      0      0      0 BMRU
vethwepl  1376   628192      0      0 0      49640642      0      0      0 BMRU
vethwepl  1376   602197      0      0 0      23959575      0      0      0 BMRU
vethwepl  1376  3902053      0      0 0      26806927      0      0      0 BMRU
vethwepl  1376 34178381      0      0 0      80727790      0      0      0 BMRU
vethwepl  1376   412591      0      0 0      59372968      0      0      0 BMRU
vethwepl  1376 26204639      0      0 0      46922074      0      0      0 BMRU
vethwepl  1376   851145      0      0 0      23752469      0      0      0 BMRU
vethwepl  1376  4906902      0      0 0      52762987      0      0      0 BMRU
vethwepl  1376  4853889      0      0 0      85149311      0      0      0 BMRU
vethwepl  1376   627240      0      0 0      49641463      0      0      0 BMRU
vethwepl  1376  4182867      0      0 0      26993210      0      0      0 BMRU
vethwepl  1376  2036563      0      0 0      50886646      0      0      0 BMRU
vxlan-67 65485  8572539      0      0 0       8038184      0      0      0 BMRU
weave     1376 489469478      0      0 0      385308354      0      0      0 BMRU
kube-node-42 (click to expand)
+ dmesg -T
[Wed May 15 12:24:54 2019] vxlan: Cannot bind port 46382, err=-98
[Wed May 15 12:24:54 2019] weave: port 22(vethwedu) entered blocking state
[Wed May 15 12:24:54 2019] weave: port 22(vethwedu) entered disabled state
[Wed May 15 12:24:54 2019] device vethwedu entered promiscuous mode
[Wed May 15 12:24:54 2019] device vethwedu left promiscuous mode
[Wed May 15 12:24:54 2019] weave: port 22(vethwedu) entered disabled state
[Wed May 15 12:24:55 2019] device vxlan-6784 left promiscuous mode
[Wed May 15 12:24:55 2019] device vxlan-6784 entered promiscuous mode

+ journalctl -u docker.service --no-pager
May 15 12:25:16 idc-eco-prod-kube-42 dockerd[799]: time="2019-05-15T08:25:16.145224224-04:00" level=warning msg="Unknown healthcheck type 'NONE' (expected 'CMD') in container 240c01fd46571e7f9da0bed1ad1a185ca15b4e81d0cfff3bd8ed714f8ebdd079"

+ journalctl -u kubelet --no-pager
no logs

+ netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
datapath  1376 88577770      0      0 0          4205      0      0      0 BMRU
docker0   1500        0      0      0 0             0      0      0      0 BMU
eth0      1500 969913704      0  16740 0      589661394      0      0      0 BMRU
lo       65536  1617694      0      0 0       1617694      0      0      0 LRU
vethwe-b  1376 122889041      0      0 0      38111417      0      0      0 BMRU
vethwe-d  1376 38111417      0      0 0      122889041      0      0      0 BMRU
vethwepl  1376 24310095      0      0 0      79952339      0      0      0 BMRU
vethwepl  1376   856677      0      0 0      32990912      0      0      0 BMRU
vethwepl  1376   468731      0      0 0      68095547      0      0      0 BMRU
vethwepl  1376   736316      0      0 0      58395932      0      0      0 BMRU
vethwepl  1376   504592      0      0 0      32163624      0      0      0 BMRU
vethwepl  1376  6469812      0      0 0      64169644      0      0      0 BMRU
vethwepl  1376  6059487      0      0 0      37549172      0      0      0 BMRU
vethwepl  1376  4534436      0      0 0      35797371      0      0      0 BMRU
vethwepl  1376  5462895      0      0 0      64894328      0      0      0 BMRU
vethwepl  1376  7148589      0      0 0      96569069      0      0      0 BMRU
vethwepl  1376 19032073      0      0 0      76530714      0      0      0 BMRU
vethwepl  1376  2333877      0      0 0      66961961      0      0      0 BMRU
vethwepl  1376  2055192      0      0 0      33444896      0      0      0 BMRU
vethwepl  1376   807953      0      0 0      32922493      0      0      0 BMRU
vethwepl  1376  1178350      0      0 0      16230754      0      0      0 BMRU
vethwepl  1376   911963      0      0 0      32524759      0      0      0 BMRU
vethwepl  1376 39159306      0      0 0      67359639      0      0      0 BMRU
vethwepl  1376   158591      0      0 0      88734307      0      0      0 BMRU
vethwepl  1376  4603458      0      0 0      72221675      0      0      0 BMRU
vethwepl  1376 45394100      0      0 0      100466503      0      0      0 BMRU
vethwepl  1376  7228246      0      0 0      64644074      0      0      0 BMRU
vxlan-67 65485  4326431      0      0 0       3457232      0      0      0 BMRU
weave     1376 500219888      0      0 0      396786475      0      0      0 BMRU

@ReillyTevera
Copy link
Contributor

This might be the same issue as #3619

@murali-reddy
Copy link
Contributor

@pfcarrier thanks for the logs. I don't see any thing particularly suspicious

For e.g. it appears kube-node-24 was cut of from the rest of the nodes in the cluster. Which is inline with your original condition under which problem happens. However from the logs, connections are gracefully reestablished to kube-node-24.

snip from weave-net-vjrvd

INFO: 2019/05/15 05:26:44.215702 overlay_switch ->[06:07:7f:a1:ae:96(kube-node-24)] sleeve timed out waiting for UDP heartbeat
DEBU: 2019/05/15 05:26:45.053689 fastdp ->[192.168.62.62:6784|06:07:7f:a1:ae:96(kube-node-24)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2019/05/15 05:26:45.053769 overlay_switch ->[06:07:7f:a1:ae:96(kube-node-24)] Removing fastdp from the list of forwarders
INFO: 2019/05/15 05:26:45.056608 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection shutting down due to error: no working forwarders to 06:07:7f:a1:ae:96(kube-node-24)
INFO: 2019/05/15 05:26:45.056740 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection deleted
INFO: 2019/05/15 05:28:41.211163 ->[192.168.62.62:57563|06:07:7f:a1:ae:96(kube-node-24)]: connection fully established
INFO: 2019/05/15 07:54:42.037303 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection fully established

snip from weave-net-vzcrb

INFO: 2019/05/15 05:26:44.215702 overlay_switch ->[06:07:7f:a1:ae:96(kube-node-24)] sleeve timed out waiting for UDP heartbeat
DEBU: 2019/05/15 05:26:45.053689 fastdp ->[192.168.62.62:6784|06:07:7f:a1:ae:96(kube-node-24)]: missed Heartbeat from peer, marking fastdp forwarder as un-healthy
DEBU: 2019/05/15 05:26:45.053769 overlay_switch ->[06:07:7f:a1:ae:96(kube-node-24)] Removing fastdp from the list of forwarders
INFO: 2019/05/15 05:26:45.056608 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection shutting down due to error: no working forwarders to 06:07:7f:a1:ae:96(kube-node-24)
INFO: 2019/05/15 05:26:45.056740 ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: connection deleted
INFO: 2019/05/15 05:26:51.633516 Removed unreachable peer 06:07:7f:a1:ae:96(kube-node-24)
INFO: 2019/05/15 05:28:41.069885 ->[192.168.62.62:57563|06:07:7f:a1:ae:96(kube-node-24)]: connection ready; using protocol version 2
INFO: 2019/05/15 05:28:41.211163 ->[192.168.62.62:57563|06:07:7f:a1:ae:96(kube-node-24)]: connection fully established

snip from weave-net-nh5ts

INFO: 2019/05/15 05:26:41.702386 ->[192.168.62.62:52643|06:07:7f:a1:ae:96(kube-node-24)]: connection shutting down due to error: read tcp4 192.168.62.55:6783->192.168.62.62:52643: i/o timeout
INFO: 2019/05/15 05:26:41.702499 ->[192.168.62.62:52643|06:07:7f:a1:ae:96(kube-node-24)]: connection deleted
INFO: 2019/05/15 05:26:51.634263 Removed unreachable peer 06:07:7f:a1:ae:96(kube-node-24)
INFO: 2019/05/15 05:28:41.064029 ->[192.168.62.62:57321|06:07:7f:a1:ae:96(kube-node-24)]: connection ready; using protocol version 2
INFO: 2019/05/15 05:28:41.064197 overlay_switch ->[06:07:7f:a1:ae:96(kube-node-24)] using fastdp
INFO: 2019/05/15 05:28:41.064265 ->[192.168.62.62:57321|06:07:7f:a1:ae:96(kube-node-24)]: connection added (new peer)
DEBU: 2019/05/15 05:28:41.105326 fastdp ->[192.168.62.62:6784|06:07:7f:a1:ae:96(kube-node-24)]: confirmed
DEBU: 2019/05/15 05:28:41.105423 sleeve ->[<nil>|06:07:7f:a1:ae:96(kube-node-24)]: Confirm
DEBU: 2019/05/15 05:28:41.105529 sleeve ->[<nil>|06:07:7f:a1:ae:96(kube-node-24)]: confirmed
DEBU: 2019/05/15 05:28:41.105544 fastdp ->[192.168.62.62:6784|06:07:7f:a1:ae:96(kube-node-24)]: sending Heartbeat to peer
DEBU: 2019/05/15 05:28:41.106732 sleeve ->[<nil>|06:07:7f:a1:ae:96(kube-node-24)]: handleHeartbeat
DEBU: 2019/05/15 05:28:41.106769 sleeve ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: sendHeartbeat
DEBU: 2019/05/15 05:28:41.139295 sleeve ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: handleHeartbeatAck
DEBU: 2019/05/15 05:28:41.139327 sleeve ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: sendFragTest
DEBU: 2019/05/15 05:28:41.139680 sleeve ->[192.168.62.62:6783|06:07:7f:a1:ae:96(kube-node-24)]: sendMTUTest: mtu candidate 1438
INFO: 2019/05/15 05:28:41.139775 overlay_switch ->[06:07:7f:a1:ae:96(kube-node-24)] using sleeve
INFO: 2019/05/15 05:28:41.140105 ->[192.168.62.62:57321|06:07:7f:a1:ae:96(kube-node-24)]: connection fully established

After connections reestablished I do see the activity of learning the MAC and configuring the ODP to forward packets to kube-node-24. So i am not sure what could be happening.

Perhaps when this happens, you can try to trace the pod-to-pod traffic and see where its getting dropped.

@WMostert1
Copy link

I'm seeing the exact same type of issue. Our environments are almost exactly the same. I also loose all inter-pod communication on random VM's in a VMWare environment. Thank you for the workaround in terms of adding connectivity checks in the health checks for the pods @pfcarrier, great idea. Once this happens again I will post info as well. In my case it takes a few hours, sometimes a day or two to happen.

@pfcarrier
Copy link
Author

Thanks for the report @WMostert1. Glad to hear the workaround helped :-)

Digging deeper on that issue we ended up tracking the position of each VMs to it's given ESX host in the VMware cluster ; crossmatching it with time of event where weave-net pod restarted due to losing inter-pod communication.

This highlighted that 25% of those events was matching the exact moment a VMotion occured. Confirming @murali-reddy early analysis that something in the VMware environment or network was introducing a glitch in the equation.

The rest of the 75% case is a bit of guesswork but we believe ultimately linked to VMs having a very high (+15%) VMWAIT% ( as reported by esxtop ). We theorized that this tug of war between VMs fighting for access to the CPU was the reason for the high amount of packet drop that potentially was a source of issue for any UDP traffic, which weave happen to use.

My understanding is that in VMware a 4vCPU VM need to reserve all four vCPU before it start using any of them ; which can start being an issue if the VM is collocated on a host with other VMs that have both a very high number of vCPU and also high CPU usage.

At the time of our observations our kubernetes cluster was composed of VMs sporting 16vCPU each ; on the ESX cluster there were also a few whale VMs at 50vCPU as well as many behemoths at 24vCPU. Given that each ESX host had only 32 real cores this was creating a lot of contention.

Sadly by the time we proceeded to right size the whales and behemoths the kubernetes cluster running weave was decommissioned.

While VMware it not my area of expertise and I cannot confirm 100% this would had fix our issue my hunch would be to recommend the following to VMware users running into the same scenario :

  • Check if esxtop is reporting very high vmwait% for your kubernetes nodes
  • confirm/infirm if the issue coincide with VMotion event.
  • Consider lowering the number of vCPU on you k8s nodes ; adding more nodes to your cluster instead. VMware seem to be happier with many small VM vs a few gorilla ones.
  • If lowering CPU is not possible consider using resource pool combined with cpu share to limit the tug of war.

@pfcarrier
Copy link
Author

I will close the issue since I lost the ability to iterate on it with the cluster no longer being in existence.

As a last report I would like to share that last week we observed on that same ESX cluster an issue were a dozen of non-kubernetes VMs all lost network access for a few minutes at the same time. This happened after the VMs right sizing was complete so my hypothesis in the comment just above potentially linking issue to CPU contention is likely incorrect or at best was only a factor. Hope it help.

@debMan
Copy link

debMan commented Sep 21, 2021

Thanks a lot, dear @pfcarrier for the detailed comments.

While VMware it not my area of expertise and I cannot confirm 100% this would had fix our issue my hunch would be to recommend the following to VMware users running into the same scenario :

  • Check if esxtop is reporting very high vmwait% for your kubernetes nodes
  • confirm/infirm if the issue coincide with VMotion event.
  • Consider lowering the number of vCPU on you k8s nodes ; adding more nodes to your cluster instead. VMware seem to be happier with many small VM vs a few gorilla ones.
  • If lowering CPU is not possible consider using resource pool combined with cpu share to limit the tug of war.

We have exactly the same issue inside a single ESXi VMware host, containing 3 Kubernetes nodes VMs, that allocated more CPUs than the host to the VMs.
I'll try your hypothesis and reply here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants