-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Bottlerocket nodes stuck "NotReady" #3076
Comments
Thanks for the report @shay-ul ! That's an odd failure case. Is there anything different about the nodes that get stuck in "NotReady"? Are all of the nodes being launched the same instance type? What instance type are they? Also, if you're able to get to any of the nodes, it would be helpful to jump on and look for any offending messages in the journal. Just in case you hadn't seen it, the |
@zmrow The failure is not limited to a specific instance type. In the exapmple above, there's a I couldn't find more interesting logs in the journal. I can download the full journal and upload it to a support case if that can be of any help. |
Thanks! I missed asking yesterday - which version of Bottlerocket are you using? 1.13.4 had a runc issue causing high memory utilization 1.13.5 was cut that reverted the runc change and fixed the issue. |
Update: we are currently investigating possible network/routing/security-group related issues, since we figured out that whenever we have such NotReady node, it is always provisioned with the same IP address and host name. |
Image I'm using:
v1.25.6-eks-232056e
What I expected to happen:
During peak hours of our environment, Karpenter provisions many Bottlerocket nodes, and deletes them when the workload scales down.
Most of the Bottlerocket nodes get to a "Ready" state quickly. However, every once in a while, a newly provisioned node would be stuck "NotReady".
What actually happened:
When describing the node, we can see that the main issue is
container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
.Here is the full description of such node:
kubectl describe node
when exploring further, I found out that the pod that should initialize the vpc-cni-plugin (aws-node pod) is stuck "Pending" due to
1 Insufficient memory.
:kubectl describe pod aws-node
As you can see, It's hard to figure out why default-scheduler reports "Insufficient memory" since there is no indication of such issue on the description of the node.
It's important to note that more daemonset pods are stuck at the same state -
kube-proxy
andebs-csi-node
. However,efs-csi-node
and Datadog agent appear to be in "Running" state.How to reproduce the problem:
No idea.
Can you please help me figure out which log/metric should I look for?
Thanks!
The text was updated successfully, but these errors were encountered: