-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 ContainerNetwork interface ( Not able to create any running pods anymore) #3425
Comments
Did you deploy anything else to your cluster, or make any modifications to it during this time? Did you literally just let it sit idle for 2 days? Can you attach the complete output of |
@brandond I only have rancher , certmanger , a gitlab agent and nginx deployment running in my cluster .All were deployed to my cluster on the first day i installed RKE2(before 2 days). The nginx deploymnet was created on the day i installed rke2 and is running until now. But now if i try to create and kill any pod it is stuck in a Containercreating or Terminating state. The only things i did today is just creating a new user with admin rights in rancher and creating a new project. journalctl -u rke2-server --no-pager >rke2.log is not printing out anything out.
|
No, this creates a file that you're supposed to attach to a comment.
I didn't want a directory listing, I was asking you to tar/zip them up and attach them to a comment. |
I can also confirm seeing this issue on a brand new v1.24.6+rke2r1 cluster, and ones upgraded to 1.24.6. I've managed to reproduce it on Centos 8 (single node), RHEL8 (3m/3w, 3m/2w), as well as Ubuntu 20.04.05LTS (3 all role nodes). Repo is as simple as:
My troubleshooting shows the token for the CNI in Calico does have a built in process to watch this token and update it, however, the container appears to not have the hostPath mount to actually update that file.
I can also see in the kube-apiserver logs, after trying to create a pod, it logs
I assume this is the CNI attempting to use the expired token, and getting turned away. I've attached the output of Looking at my other clusters running 1.23, they appear to have the same error from the canal pod, however the on disk token is good for a year. (which may explain why projectcalico/calico#5712 (comment) fixes the same sort of issue too, this may potentially have always been broken in calico, but the reduced lifetime of the token has highlighted it) (Full Disclosure if I look familiar: I've been working with SUSE Support on this under case #00364320) |
Just had the same problem and yolo fixed it with:
|
Thanks @rbrtbnfgl, would this fix be included as a backport to v1.24 (or further)? The reason is that v1.25 is currently not supported by Rancher as yet, and may not be for some time. |
I've run into this as well, first while running the latest stable version of RKE2, and then after a rebuild, v1.22.15+rke2r1. Both clusters exhibited an inability to create new pods roughly 24hrs after standing them up, as @brownz11 describes. Is there a new version of calico/canal that has been incorporated to the latest patches/versions of rke2? Restarting the nodes temporarily fixes the issue... |
Adding a second cni as a addition solved the problem. That means whenever calico cni dies , cilium will take over and we have a new cni operating. Even the reaction time of my rancher GUI has become much faster. |
I am running an RKE2 setup to check if it fixes the issue. Changing the date of the node will reproduce the issue but it's not triggering the token renewal. |
While the patches bake we found a simple work around, although it feels hacky, is to schedule the restart of the canal daemonset every 12 hours with cron. Just schedule it on one of the master nodes, and make sure it's as a user that can read the yaml.
Can also on-demand fix the issue for 24 hours via |
Thanks so much ! |
@rbrtbnfgl @brandond Is there any SAFE way i can change the cni plugin to cilium without messing with my node. i tried to change it before and it messed up with my whole cluster and i had to reinstall it again. It is now using as a cni calico that was installed default. it would be nice if you could provide me with an example |
No, we don't support changing the CNI on a running cluster. There is not really any good way to do that. |
@brandond How about before installing the whole rke2 server ? Is there a way i can set the cni plugin to cilium during the installation. |
Just create the config file before installing and starting RKE2. https://docs.rke2.io/install/install_options/install_options/#configuration-file mkdir -p /etc/rancher/rke2
echo "cni: cilium" >> /etc/rancher/rke2/config.yaml
curl -sfL https://get.rke2.io/ | sh -
systemctl enable --now rke2-server.service |
@brandond network: |
no, it should look exactly like the example I showed above. |
@brandond Where is the cni declared here in this case? i can not see the cni key in this config file. I extracted this config file from the link you sent me above. write-kubeconfig-mode: "0644"
tls-san:
- "foo.local"
node-label:
- "foo=bar"
- "something=amazing" |
That's not the example. The example is right below that and shows That's it. That's all you need to put in there. |
@brandond this is my config before I installed to change cni. I have 3 master + 2 workers. So it’s also declaring a fqdn for HA and the IP for that HA. tls-san:
- node1
- node1.local.mydomain.com
- cluster.my domain.com
- YOUR_CLUSTER_IP
disable: rke2-ingress-nginx #this is if you use traefik, if not just delete it
cni:
- cilium |
You shouldn't need to disable anything or customize the TLS SANs, but you're welcome to do so. Syntactically it looks good. |
Will this be backported to 1.22? We just upgraded to 1.22 from 1.21 multiple clusters and after 24h they stopped working 😢 |
Can we get a list of rke2 versions affected or at least the latest unaffected one? I need to create a few clusters in the near future and I'd rather not have them broken the next day. |
This is the commit in calico which changed from 1year (if you have default flag --service-account-extend-token-expiration in kube-apiserver) to 24h projectcalico/calico@2b3469b The only weird thing is that rke2 1.22 is not using calico 3.24 where this was changed? |
the issue is from calico 3.23 that was updated on the latest versions. |
@rbrtbnfgl what would be for now the efficient way to solve this problem temporarily? I dont want my conatiners break after 24hours since i have important conatiners running that must not be out of function. |
ah yeah now i found the commit in 3.23 aswell: projectcalico/calico@34e7fec |
#3425 (comment) this could be the efficent solution until the new release is out |
I will also note that we had added similar token handling code for our Calico-on-Windows support a while back, as the approach used by upstream was identified as problematic: |
Yes. There will be a r2 of 1.22.15 with this fix |
We also just hit this after upgrading to v1.22.15+rke2r1 last night. Thank goodness for that workaround. |
Validated on master branch using commit id 532aed3Environment DetailsInfrastructure
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
Config.yaml:
Steps to reproduce the issue and validate the fix
Issue: Pod stuck in container creating state
Validation: Pod was created successfully with the fix
Validation: After upgrade pod was created successfully with the fix
|
Environmental Info:
RKE2 Version:
v1.24.6+rke2r1
Node(s) CPU architecture, OS, and Version:
one Node with roles control-plane,etcd,master , CPU 16 Cores
Cluster Configuration:
one RKE2 server running and in a Ready status
Describe the bug:
Everything was running well until now but suddenly after 2 days every pod i create is stuck in container creating status
Steps To Reproduce:
Expected behavior:
Actual behavior:
Every new created Pods are stuck in container creating status
Additional context / logs:
The text was updated successfully, but these errors were encountered: