-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944
Comments
Small correction, apparently during scale up event, it happens on the amd64 node types too.
|
How much time are we talking about? We'd probably need to look at your control plane logs to see what happened, but my guess would be the event queue of the certificate controller is getting backed up when many CSR's are created in a short period of time. If you want to open a case with AWS Support and provide your cluster ARN, we can look into it. |
Yea I figured as much. It only happens under load when nodes provisioned simultaneously. Oh PS we resolved the symptom by scaling up less aggressively. But the root cause remains, under scale up condition there will be a time when node is ready, pods are scheduled and running but you can't do anything with them because kubectl and other tools can't reach them (buildx for example) |
For the serving cert (what you're seeing) I think there's an argument either way. If your workloads depend on kubelet's HTTP server (which is not common at all), then it would make sense to delay pod scheduling until that CSR is approved. But in most cases, kubelet can go ahead and start doing useful work (pulling your container images, getting pods running) while it waits for the CSR to be approved. The consensus here was that this shouldn't be part of node readiness by default: kubernetes/kubernetes#73047 I'd support an optional toggle for it if you want to reboot the discussion 👍 |
Ah, sounds like its a non-issue then. Don't want to get into lengthy discussions of what constitutes a ready node. It's way above my paygrade :) |
What happened:
When provisioning many nodes at the same time on EKS 1.29 specifically with ARM64 node image on AL 2023 (latest) sometimes a node would come up and report its status as ready before its CSR is signed and approved. This results in a situation where the pods are running on the node but one cannot run
kubectl logs <POD_NAME>
which returnstls internal error
The condition goes away after a bit by itself.
The below commands were run simultaneously:
After a while the CSR is finally approved and we can run kubectl commands against pods on this node:
What you expected to happen:
The node should not post ready status until we can run kubectl commands against it.
How to reproduce it (as minimally and precisely as possible):
Spin up multiple nodes (20 should be enough) at the same time. And watch for their ready status while simultaneously watching for their CSR. There will be a short time window where the nodes will post ready but CSRs would still be pending.
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: