Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944

Closed
dcherniv opened this issue Aug 31, 2024 · 5 comments
Closed

Comments

@dcherniv
Copy link

dcherniv commented Aug 31, 2024

What happened:
When provisioning many nodes at the same time on EKS 1.29 specifically with ARM64 node image on AL 2023 (latest) sometimes a node would come up and report its status as ready before its CSR is signed and approved. This results in a situation where the pods are running on the node but one cannot run kubectl logs <POD_NAME> which returns tls internal error
The condition goes away after a bit by itself.
The below commands were run simultaneously:

dcherniv@lildebbie:~/$ kubectl get nodes
NAME                          STATUS     ROLES    AGE     VERSION
[...]
ip-10-8-12-74.ec2.internal    Ready      <none>   40s     v1.29.6-eks-1552ad0
[...]

dcherniv@lildebbie:~/$ for i in `kubectl get nodes | awk '{print $1}' | grep -v NAME` ; do kubectl get csr | grep $i ; done
[...]
csr-q4dm7   44s     kubernetes.io/kubelet-serving   system:node:ip-10-8-12-74.ec2.internal    <none>              Pending
[...]

After a while the CSR is finally approved and we can run kubectl commands against pods on this node:

dcherniv@lildebbie:~/$ kubectl get csr | grep 12-74
csr-q4dm7   74s     kubernetes.io/kubelet-serving   system:node:ip-10-8-12-74.ec2.internal    <none>              Approved,Issued

What you expected to happen:
The node should not post ready status until we can run kubectl commands against it.

How to reproduce it (as minimally and precisely as possible):
Spin up multiple nodes (20 should be enough) at the same time. And watch for their ready status while simultaneously watching for their CSR. There will be a short time window where the nodes will post ready but CSRs would still be pending.

Anything else we need to know?:

Environment:

- AWS Region: us-east-1
- Instance Type(s): m6g.2xlarge
- EKS Platform version (use `aws eks describe-cluster --name <name> --query cluster.platformVersion`): eks.10
- Kubernetes version (use `aws eks describe-cluster --name <name> --query cluster.version`): 1.29
- AMI Version: amazon-eks-node-al2023-arm64-standard-1.29-v20240828
@dcherniv
Copy link
Author

Small correction, apparently during scale up event, it happens on the amd64 node types too.

ip-10-8-13-89.ec2.internal    Ready      <none>   13s     v1.29.6-eks-1552ad0
csr-txknw   21s     kubernetes.io/kubelet-serving   system:node:ip-10-8-13-89.ec2.internal    <none>              Pending

@cartermckinnon
Copy link
Member

After a while the CSR is finally approved

How much time are we talking about?

We'd probably need to look at your control plane logs to see what happened, but my guess would be the event queue of the certificate controller is getting backed up when many CSR's are created in a short period of time. If you want to open a case with AWS Support and provide your cluster ARN, we can look into it.

@dcherniv
Copy link
Author

dcherniv commented Sep 6, 2024

Yea I figured as much. It only happens under load when nodes provisioned simultaneously.
20-30 second delay after node is reported and until CSR is signed.
I can create a case for sure. But it seems there is a fundamental problem here. Should node report itself as ready before its CSR is approved?

Oh PS we resolved the symptom by scaling up less aggressively. But the root cause remains, under scale up condition there will be a time when node is ready, pods are scheduled and running but you can't do anything with them because kubectl and other tools can't reach them (buildx for example)

@cartermckinnon
Copy link
Member

Should node report itself as ready before its CSR is approved?

For the serving cert (what you're seeing) I think there's an argument either way. If your workloads depend on kubelet's HTTP server (which is not common at all), then it would make sense to delay pod scheduling until that CSR is approved. But in most cases, kubelet can go ahead and start doing useful work (pulling your container images, getting pods running) while it waits for the CSR to be approved.

The consensus here was that this shouldn't be part of node readiness by default: kubernetes/kubernetes#73047

I'd support an optional toggle for it if you want to reboot the discussion 👍

@dcherniv
Copy link
Author

dcherniv commented Sep 9, 2024

Ah, sounds like its a non-issue then. Don't want to get into lengthy discussions of what constitutes a ready node. It's way above my paygrade :)
Closing this with WONTFIX.

@dcherniv dcherniv closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants