EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944

dcherniv · 2024-08-31T22:13:50Z

What happened:
When provisioning many nodes at the same time on EKS 1.29 specifically with ARM64 node image on AL 2023 (latest) sometimes a node would come up and report its status as ready before its CSR is signed and approved. This results in a situation where the pods are running on the node but one cannot run kubectl logs <POD_NAME> which returns tls internal error
The condition goes away after a bit by itself.
The below commands were run simultaneously:

dcherniv@lildebbie:~/$ kubectl get nodes
NAME                          STATUS     ROLES    AGE     VERSION
[...]
ip-10-8-12-74.ec2.internal    Ready      <none>   40s     v1.29.6-eks-1552ad0
[...]

dcherniv@lildebbie:~/$ for i in `kubectl get nodes | awk '{print $1}' | grep -v NAME` ; do kubectl get csr | grep $i ; done
[...]
csr-q4dm7   44s     kubernetes.io/kubelet-serving   system:node:ip-10-8-12-74.ec2.internal    <none>              Pending
[...]

After a while the CSR is finally approved and we can run kubectl commands against pods on this node:

dcherniv@lildebbie:~/$ kubectl get csr | grep 12-74
csr-q4dm7   74s     kubernetes.io/kubelet-serving   system:node:ip-10-8-12-74.ec2.internal    <none>              Approved,Issued

What you expected to happen:
The node should not post ready status until we can run kubectl commands against it.

How to reproduce it (as minimally and precisely as possible):
Spin up multiple nodes (20 should be enough) at the same time. And watch for their ready status while simultaneously watching for their CSR. There will be a short time window where the nodes will post ready but CSRs would still be pending.

Anything else we need to know?:

Environment:

- AWS Region: us-east-1
- Instance Type(s): m6g.2xlarge
- EKS Platform version (use `aws eks describe-cluster --name <name> --query cluster.platformVersion`): eks.10
- Kubernetes version (use `aws eks describe-cluster --name <name> --query cluster.version`): 1.29
- AMI Version: amazon-eks-node-al2023-arm64-standard-1.29-v20240828

The text was updated successfully, but these errors were encountered:

dcherniv · 2024-08-31T22:46:59Z

Small correction, apparently during scale up event, it happens on the amd64 node types too.

ip-10-8-13-89.ec2.internal    Ready      <none>   13s     v1.29.6-eks-1552ad0
csr-txknw   21s     kubernetes.io/kubelet-serving   system:node:ip-10-8-13-89.ec2.internal    <none>              Pending

cartermckinnon · 2024-09-06T19:12:31Z

After a while the CSR is finally approved

How much time are we talking about?

We'd probably need to look at your control plane logs to see what happened, but my guess would be the event queue of the certificate controller is getting backed up when many CSR's are created in a short period of time. If you want to open a case with AWS Support and provide your cluster ARN, we can look into it.

dcherniv · 2024-09-06T23:15:37Z

Yea I figured as much. It only happens under load when nodes provisioned simultaneously.
20-30 second delay after node is reported and until CSR is signed.
I can create a case for sure. But it seems there is a fundamental problem here. Should node report itself as ready before its CSR is approved?

Oh PS we resolved the symptom by scaling up less aggressively. But the root cause remains, under scale up condition there will be a time when node is ready, pods are scheduled and running but you can't do anything with them because kubectl and other tools can't reach them (buildx for example)

cartermckinnon · 2024-09-09T19:27:21Z

Should node report itself as ready before its CSR is approved?

For the serving cert (what you're seeing) I think there's an argument either way. If your workloads depend on kubelet's HTTP server (which is not common at all), then it would make sense to delay pod scheduling until that CSR is approved. But in most cases, kubelet can go ahead and start doing useful work (pulling your container images, getting pods running) while it waits for the CSR to be approved.

The consensus here was that this shouldn't be part of node readiness by default: kubernetes/kubernetes#73047

I'd support an optional toggle for it if you want to reboot the discussion 👍

dcherniv · 2024-09-09T20:34:12Z

Ah, sounds like its a non-issue then. Don't want to get into lengthy discussions of what constitutes a ready node. It's way above my paygrade :)
Closing this with WONTFIX.

dcherniv mentioned this issue Aug 31, 2024

buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error docker/buildx#2668

Open

3 tasks

dcherniv mentioned this issue Sep 4, 2024

Failed to scrape node: remote error: tls: internal error kubernetes-sigs/metrics-server#1480

Open

dcherniv closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944

EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944

dcherniv commented Aug 31, 2024 •

edited

Loading

dcherniv commented Aug 31, 2024

cartermckinnon commented Sep 6, 2024

dcherniv commented Sep 6, 2024 •

edited

Loading

cartermckinnon commented Sep 9, 2024

dcherniv commented Sep 9, 2024

EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944

EKS node on ARM64 is reported ready before its corresponding CSR is signed. #1944

Comments

dcherniv commented Aug 31, 2024 • edited Loading

dcherniv commented Aug 31, 2024

cartermckinnon commented Sep 6, 2024

dcherniv commented Sep 6, 2024 • edited Loading

cartermckinnon commented Sep 9, 2024

dcherniv commented Sep 9, 2024

dcherniv commented Aug 31, 2024 •

edited

Loading

dcherniv commented Sep 6, 2024 •

edited

Loading