Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 ContainerNetwork interface ( Not able to create any running pods anymore) #3425

Closed
smartlocus opened this issue Oct 5, 2022 · 34 comments
Assignees
Labels

Comments

@smartlocus
Copy link

smartlocus commented Oct 5, 2022

Environmental Info:
RKE2 Version:
v1.24.6+rke2r1

Node(s) CPU architecture, OS, and Version:
one Node with roles control-plane,etcd,master , CPU 16 Cores
Cluster Configuration:
one RKE2 server running and in a Ready status

Describe the bug:
Everything was running well until now but suddenly after 2 days every pod i create is stuck in container creating status

kubectl describe pod ......
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3ef6f72e24d432e63eeb31792529fca8b6908fdd7423961b345846eaa0fdcea3": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

Steps To Reproduce:

  • Installed RKE2:
# On rancher1
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=server sh - 

# start and enable for restarts - 
systemctl enable rke2-server.service 
systemctl start rke2-server.service

Expected behavior:

Podname               Status 
nginx                      Running

Actual behavior:

Podname               Status 
nginx                      Containercreating

Every new created Pods are stuck in container creating status

Additional context / logs:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3ef6f72e24d432e63eeb31792529fca8b6908fdd7423961b345846eaa0fdcea3": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
@brandond
Copy link
Member

brandond commented Oct 5, 2022

Did you deploy anything else to your cluster, or make any modifications to it during this time? Did you literally just let it sit idle for 2 days? Can you attach the complete output of journalctl -u rke2-server --no-pager >rke2.log and everything under /var/log/pods?

@smartlocus
Copy link
Author

smartlocus commented Oct 5, 2022

@brandond I only have rancher , certmanger , a gitlab agent and nginx deployment running in my cluster .All were deployed to my cluster on the first day i installed RKE2(before 2 days). The nginx deploymnet was created on the day i installed rke2 and is running until now. But now if i try to create and kill any pod it is stuck in a Containercreating or Terminating state. The only things i did today is just creating a new user with admin rights in rancher and creating a new project.

journalctl -u rke2-server --no-pager >rke2.log is not printing out anything out.
Screenshot (285)

/var/log/pods#
drwxr-xr-x 33 root root   4096 Oct  5 22:27 .
drwxrwxr-x 17 root syslog 4096 Oct  5 00:00 ..
drwxr-xr-x  3 root root   4096 Oct  3 13:45 cattle-fleet-local-system_fleet-agent-59d884fc56-gzc5s_3f9c8534-fc45-4ed8-b229-32c03869880e
drwxr-xr-x  3 root root   4096 Oct  3 13:44 cattle-fleet-system_fleet-controller-58db4bf695-trg9m_d12cc503-5379-4159-bc8e-32432d775331
drwxr-xr-x  3 root root   4096 Oct  3 13:44 cattle-fleet-system_gitjob-8ccfb5499-r6c6d_cf02d803-d116-45b7-a03a-8568368e2c98
drwxr-xr-x  3 root root   4096 Oct  3 13:43 cattle-system_rancher-69595dc9c4-2bc59_22047614-8a0d-46e0-b775-e87b57a44097
drwxr-xr-x  3 root root   4096 Oct  3 13:43 cattle-system_rancher-69595dc9c4-9xh65_6d64d801-520e-4b6f-a242-b75157d1ac27
drwxr-xr-x  3 root root   4096 Oct  3 13:43 cattle-system_rancher-69595dc9c4-sq74j_b8116439-bf42-42d5-b77d-54ae289ee54c
drwxr-xr-x  3 root root   4096 Oct  3 13:45 cattle-system_rancher-webhook-576c5b6859-q6rl2_563664de-4e10-4742-ba91-b22f24e3d2fa
drwxr-xr-x  3 root root   4096 Oct  3 13:42 cert-manager_cert-manager-877fd747c-t24gq_6fcbbe40-dd7e-4bb7-94a9-33e47cca4749
drwxr-xr-x  3 root root   4096 Oct  3 13:42 cert-manager_cert-manager-cainjector-bbdb88874-wdkx4_35c4c7f9-0ff9-43e8-b304-8dd93bd2b170
drwxr-xr-x  3 root root   4096 Oct  3 13:42 cert-manager_cert-manager-webhook-5774d5d8f7-bz5md_aded152d-f18d-4ba8-9314-38021fe47234
drwxr-xr-x  2 root root   4096 Oct  5 20:22 default_awet-6fb767bf55-zshbv_fbcb3288-535f-4c03-ae60-4e0b84d9bea2
drwxr-xr-x  3 root root   4096 Oct  3 15:50 default_my-apche-5d888c68b6-65s9p_f44b4318-a7ad-4b15-a584-76351e37ebc4
drwxr-xr-x  2 root root   4096 Oct  5 19:43 default_my-apche-5d888c68b6-mm7wz_eecbb829-5b79-4e2e-8f02-546941496318
drwxr-xr-x  2 root root   4096 Oct  5 21:35 default_noh-76b6df9659-qnrtl_d9e1ed93-ab26-442c-b55b-d71e1695e0e0
drwxr-xr-x  2 root root   4096 Oct  5 20:58 default_test-764c85dd84-k5489_9ae7404d-37a6-419e-98a6-e48c031f243f
drwxr-xr-x  3 root root   4096 Oct  3 17:21 gitlab-agent_primary-agent-gitlab-agent-6f99c99894-f54nf_2bf53baa-24b4-4eea-ad2f-8bd09f811505
drwxr-xr-x  3 root root   4096 Oct  3 13:01 kube-system_cloud-controller-manager-server3_ff943ee0c21582fe1d1f4345a1af14e9
drwxr-xr-x  3 root root   4096 Oct  3 13:00 kube-system_etcd-server3_c7e427214f1134527952cb171b3ac3cc
drwxr-xr-x  3 root root   4096 Oct  3 13:01 kube-system_kube-apiserver-server3_1f6328c150f2fa5f124fa1cadcbcb510
drwxr-xr-x  3 root root   4096 Oct  3 13:01 kube-system_kube-controller-manager-server3_cbb4392e0a198ac90c40528fd37eb8c8
drwxr-xr-x  3 root root   4096 Oct  3 13:01 kube-system_kube-proxy-server3_a4ace2fcb2a4f79df3b6527b20071a68
drwxr-xr-x  3 root root   4096 Oct  3 13:01 kube-system_kube-scheduler-server3_6d16ed6f54a309dd8f0327f67ac1c250
drwxr-xr-x  6 root root   4096 Oct  3 13:02 kube-system_rke2-canal-7sjmf_6f7cf4b1-54cb-4bcb-97cc-f12acea66078
drwxr-xr-x  3 root root   4096 Oct  3 13:02 kube-system_rke2-coredns-rke2-coredns-76cb76d66-vbjwx_57451da7-e255-4d61-b957-d63ed0d603d0
drwxr-xr-x  3 root root   4096 Oct  3 13:02 kube-system_rke2-coredns-rke2-coredns-autoscaler-58867f8fc5-ln9fg_545b60c0-059a-43b8-ad38-1e76c07645af
drwxr-xr-x  3 root root   4096 Oct  3 13:02 kube-system_rke2-ingress-nginx-controller-hbqg7_04a28e7e-4192-42cd-b426-3676155583d3
drwxr-xr-x  3 root root   4096 Oct  3 13:02 kube-system_rke2-metrics-server-6979d95f95-v5fp9_24de7289-0d33-4ed0-a76a-26a5541d8387
drwxr-xr-x  5 root root   4096 Oct  5 22:27 kube-system_weave-net-9gp6j_be24f045-949e-42e1-b906-80c559a0286c
drwxr-xr-x  3 root root   4096 Oct  3 21:01 ns-gitlab-agent-self-created_my-nginx_925a4e69-6014-4c43-ac2b-99bf348a8a48
drwxr-xr-x  2 root root   4096 Oct  5 19:41 robi_my-apche-5d888c68b6-8k4dh_2e931fc0-a33c-420e-b68c-02d3a7b348da
drwxr-xr-x  2 root root   4096 Oct  5 19:37 robi_nginx-8f458dc5b-ndnxp_388e52b2-539b-4afe-900d-5b432e58f7fb

@brandond
Copy link
Member

brandond commented Oct 5, 2022

journalctl -u rke2-server --no-pager >rke2.log is not printing out anything out.

No, this creates a file that you're supposed to attach to a comment.

drwxr-xr-x 3 root root 4096 Oct 3 13:45 cattle-fleet-local-system_fleet-agent-59d884fc56-gzc5s_3f9c8534-fc45-4ed8-b229-32c03869880e

I didn't want a directory listing, I was asking you to tar/zip them up and attach them to a comment.

@brownz11
Copy link

brownz11 commented Oct 6, 2022

I can also confirm seeing this issue on a brand new v1.24.6+rke2r1 cluster, and ones upgraded to 1.24.6. I've managed to reproduce it on Centos 8 (single node), RHEL8 (3m/3w, 3m/2w), as well as Ubuntu 20.04.05LTS (3 all role nodes).

Repo is as simple as:

  1. Install RKE2
  2. Wait for token expiration (24 hours)
  3. Try to run a pod

My troubleshooting shows the token for the CNI in /etc/cni/net.d/calico-kubeconfig being expired. Restarting the Canal pods will fix it for 24 hours until the token expires again.

Calico does have a built in process to watch this token and update it, however, the container appears to not have the hostPath mount to actually update that file.

2022-10-05T16:22:06.130024187-04:00 stdout F 2022-10-05 20:22:06.129 [INFO][51] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2022-10-05T16:22:06.130484848-04:00 stdout F 2022-10-05 20:22:06.130 [ERROR][51] cni-config-monitor/token_watch.go 276: Failed to write CNI plugin kubeconfig file error=open /host/etc/cni/net.d/calico-kubeconfig: no such file or directory

I can also see in the kube-apiserver logs, after trying to create a pod, it logs

1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"

I assume this is the CNI attempting to use the expired token, and getting turned away.

I've attached the output of journalctl -u rke2-server --no-pager >rke2.log and all the pod logs on the node from the CentOS 8 Single Node Cluster. Happy to grab anything else.
rke2_issue3425.zip

Looking at my other clusters running 1.23, they appear to have the same error from the canal pod, however the on disk token is good for a year. (which may explain why projectcalico/calico#5712 (comment) fixes the same sort of issue too, this may potentially have always been broken in calico, but the reduced lifetime of the token has highlighted it)

(Full Disclosure if I look familiar: I've been working with SUSE Support on this under case #00364320)

@brandond
Copy link
Member

brandond commented Oct 6, 2022

cc @manuelbuil @rbrtbnfgl

@jzandbergen
Copy link

Just had the same problem and yolo fixed it with:

kubectl delete pod -l k8s-app=canal

@dkeightley
Copy link
Contributor

Thanks @rbrtbnfgl, would this fix be included as a backport to v1.24 (or further)? The reason is that v1.25 is currently not supported by Rancher as yet, and may not be for some time.

@kphunter
Copy link

kphunter commented Oct 7, 2022

I've run into this as well, first while running the latest stable version of RKE2, and then after a rebuild, v1.22.15+rke2r1. Both clusters exhibited an inability to create new pods roughly 24hrs after standing them up, as @brownz11 describes. Is there a new version of calico/canal that has been incorporated to the latest patches/versions of rke2?

Restarting the nodes temporarily fixes the issue...

@smartlocus
Copy link
Author

Adding a second cni as a addition solved the problem. That means whenever calico cni dies , cilium will take over and we have a new cni operating. Even the reaction time of my rancher GUI has become much faster.
I used cilium as my second cni.
Implementation method:-
Apply the cilium plugin in the kube-sytem namespace (follow the cilium documentation)

@rbrtbnfgl
Copy link
Contributor

I am running an RKE2 setup to check if it fixes the issue. Changing the date of the node will reproduce the issue but it's not triggering the token renewal.
I started it yesterday and from the logs calico is renewing the token after 5 hours I'll let the cluster runs for a day and check if the new pods execution is not failing. For the 1.24 version if this fixes the bug I'll backport the patch.

@brownz11
Copy link

brownz11 commented Oct 7, 2022

While the patches bake we found a simple work around, although it feels hacky, is to schedule the restart of the canal daemonset every 12 hours with cron.

Just schedule it on one of the master nodes, and make sure it's as a user that can read the yaml.

0 */12 * * * KUBECONFIG=/etc/rancher/rke2/rke2.yaml /var/lib/rancher/rke2/bin/kubectl rollout restart ds/rke2-canal -n kube-system

Can also on-demand fix the issue for 24 hours via
kubectl rollout restart ds/rke2-canal -n kube-system

@linuxpham
Copy link

Thanks so much !

@smartlocus
Copy link
Author

smartlocus commented Oct 7, 2022

@rbrtbnfgl @brandond Is there any SAFE way i can change the cni plugin to cilium without messing with my node. i tried to change it before and it messed up with my whole cluster and i had to reinstall it again. It is now using as a cni calico that was installed default. it would be nice if you could provide me with an example
thanks in advance!

@brandond
Copy link
Member

brandond commented Oct 7, 2022

No, we don't support changing the CNI on a running cluster. There is not really any good way to do that.

@smartlocus
Copy link
Author

smartlocus commented Oct 7, 2022

@brandond How about before installing the whole rke2 server ? Is there a way i can set the cni plugin to cilium during the installation.
for example like this->
curl -sfL https://get.rke2.io | sh - -- set cni=cilium
systemctl enable rke2-server.service
systemctl start rke2-server.service

@brandond
Copy link
Member

brandond commented Oct 7, 2022

Just create the config file before installing and starting RKE2.

https://docs.rke2.io/install/install_options/install_options/#configuration-file

mkdir -p /etc/rancher/rke2
echo "cni: cilium" >> /etc/rancher/rke2/config.yaml
curl -sfL https://get.rke2.io/ | sh -
systemctl enable --now rke2-server.service

@smartlocus
Copy link
Author

smartlocus commented Oct 7, 2022

@brandond
1.question->Should my config File look like this or do i need to add other things? i am currently a starter with this :)

network:
plugin: cilium

@brandond
Copy link
Member

brandond commented Oct 7, 2022

no, it should look exactly like the example I showed above.

@smartlocus
Copy link
Author

smartlocus commented Oct 7, 2022

@brandond Where is the cni declared here in this case? i can not see the cni key in this config file. I extracted this config file from the link you sent me above.

write-kubeconfig-mode: "0644"
tls-san:
  - "foo.local"
node-label:
  - "foo=bar"
  - "something=amazing"

@brandond
Copy link
Member

brandond commented Oct 7, 2022

That's not the example. The example is right below that and shows cni: clilium being written to the config file.

That's it.

That's all you need to put in there.

@ilbarone87
Copy link

ilbarone87 commented Oct 8, 2022

@brandond this is my config before I installed to change cni. I have 3 master + 2 workers. So it’s also declaring a fqdn for HA and the IP for that HA.

tls-san:
- node1
- node1.local.mydomain.com
- cluster.my domain.com
- YOUR_CLUSTER_IP
disable: rke2-ingress-nginx #this is if you use traefik, if not just delete it
cni:
- cilium

@brandond
Copy link
Member

brandond commented Oct 8, 2022

You shouldn't need to disable anything or customize the TLS SANs, but you're welcome to do so. Syntactically it looks good.

@jonaz
Copy link

jonaz commented Oct 10, 2022

Will this be backported to 1.22? We just upgraded to 1.22 from 1.21 multiple clusters and after 24h they stopped working 😢

@thomashoell
Copy link

thomashoell commented Oct 10, 2022

Can we get a list of rke2 versions affected or at least the latest unaffected one? I need to create a few clusters in the near future and I'd rather not have them broken the next day.

@jonaz
Copy link

jonaz commented Oct 10, 2022

This is the commit in calico which changed from 1year (if you have default flag --service-account-extend-token-expiration in kube-apiserver) to 24h projectcalico/calico@2b3469b

The only weird thing is that rke2 1.22 is not using calico 3.24 where this was changed?

@rbrtbnfgl
Copy link
Contributor

the issue is from calico 3.23 that was updated on the latest versions.

@smartlocus
Copy link
Author

smartlocus commented Oct 10, 2022

@rbrtbnfgl what would be for now the efficient way to solve this problem temporarily? I dont want my conatiners break after 24hours since i have important conatiners running that must not be out of function.

@jonaz
Copy link

jonaz commented Oct 11, 2022

the issue is from calico 3.23 that was updated on the latest versions.

ah yeah now i found the commit in 3.23 aswell: projectcalico/calico@34e7fec

@rbrtbnfgl
Copy link
Contributor

#3425 (comment) this could be the efficent solution until the new release is out

@brandond
Copy link
Member

brandond commented Oct 11, 2022

I will also note that we had added similar token handling code for our Calico-on-Windows support a while back, as the approach used by upstream was identified as problematic:

@manuelbuil
Copy link
Contributor

manuelbuil commented Oct 11, 2022

Will this be backported to 1.22? We just upgraded to 1.22 from 1.21 multiple clusters and after 24h they stopped working cry

Yes. There will be a r2 of 1.22.15 with this fix

@mstrent
Copy link

mstrent commented Oct 13, 2022

We also just hit this after upgrading to v1.22.15+rke2r1 last night. Thank goodness for that workaround.

@JackFish
Copy link

image
update the DaemonSets [kube-system/rke2-canal], by latest official images from Rancher. It helps me to resolve this issue.

@ShylajaDevadiga
Copy link
Contributor

Validated on master branch using commit id 532aed3

Environment Details

Infrastructure

Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:

Ubuntu 20.04

Cluster Configuration:

3 servers
1 agent

Config.yaml:

cat /etc/rancher/k3s/config,yaml
cni: canal

Steps to reproduce the issue and validate the fix

  1. Copy config.yaml
  2. Install rke2 with cni: canal (default cni)
  3. Deploy a pod
  4. Deploy another pod 24h after the cluster is up

Issue: Pod stuck in container creating state

ubuntu@ip-172-31-3-160:~$ rke2 -v
rke2 version v1.25.2+rke2r1 (851733a1cefcbe182094eece241b9b75fe6aca5e)
go version go1.19 X:boringcrypto
ubuntu@ip-172-31-3-160:~$ kubectl get pod
NAME     READY   STATUS              RESTARTS   AGE
nginx1   1/1     Running             0          26h
nginx2   0/1     ContainerCreating   0          2m25s
ubuntu@ip-172-31-3-160:~$ kubectl describe  pod nginx2 |tail -1
  Warning  FailedCreatePodSandBox  8s (x3 over 36s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "607baf9aeac2e8fc87ce7190a5552848248f983c41e634fa2140db61c31ceb4e": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized

Validation: Pod was created successfully with the fix

ubuntu@ip-172-31-14-135:~$ rke2 -v
rke2 version v1.25.2-dev+532aed3e (532aed3e42cf975608fba00037a14e559bdb0c25)
go version go1.19 X:boringcrypto
ubuntu@ip-172-31-14-135:~$ kubectl get pods
NAME     READY   STATUS    RESTARTS   AGE
nginx1   1/1     Running   0          26h
nginx2   1/1     Running   0          56s

Validation: After upgrade pod was created successfully with the fix

ubuntu@ip-172-31-3-160:~$ rke2 -v
rke2 version v1.25.2+rke2r1 (851733a1cefcbe182094eece241b9b75fe6aca5e)
ubuntu@ip-172-31-3-160:~$ kubectl get pods
NAME     READY   STATUS              RESTARTS   AGE
nginx1   1/1     Running             0          35h
nginx2   0/1     ContainerCreating   0          8h

ubuntu@ip-172-31-3-160:~$ rke2 -v
rke2 version v1.25.2-dev+532aed3e (532aed3e42cf975608fba00037a14e559bdb0c25)

ubuntu@ip-172-31-3-160:~$ kubectl get pods
NAME     READY   STATUS    RESTARTS   AGE
nginx1   1/1     Running   0          35h
nginx2   1/1     Running   0          8h
nginx3   1/1     Running   0          5m50s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests