You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened: Updated our AMI to a new one and now node start to flap between Ready and NotReady after a few hours, as well as network issues like failed liveness probes, DNS resolution problems, etc
How to reproduce it (as minimally and precisely as possible): I don't have a reproduction recipe. This is a cluster that didn't have an issue and where the only change was the AMI version and the issues appeared.
Anything else we need to know?:
Our investigation shows something that looks like a connection leak in the system.
Socket count constantly increases until it breaks the node. See screenshot:
Socket count in the pods is regular, no change from the previous AMI
Restarting the docker service in the host drops the socket count to a normal value, but immediately starts to grow again.
Environment:
AWS Region: us-east-1
Instance Type(s):
EKS Platform version: eks.11
Kubernetes version: 1.21
AMI Version:
Kernel: 5.4.217-126.408.amzn2.x86_64 #1 SMP Fri Oct 14 17:08:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Container runtime version: Docker 20.10.17
Release information
BASE_AMI_ID="ami-02dd04850de58599e"
BUILD_TIME="Mon Sep 26 21:55:27 UTC 2022"
BUILD_KERNEL="5.4.209-116.367.amzn2.x86_64"
ARCH="x86_64"
AMI without the issue:
Kernel: 5.4.209-116.363.amzn2.x86_64 \#1 SMP Wed Aug 10 21:19:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Release information
BASE_AMI_ID="ami-0f51d4d93d5bee36b"
BUILD_TIME="Wed Aug 24 01:01:12 UTC 2022"
BUILD_KERNEL="5.4.209-116.363.amzn2.x86_64"
ARCH="x86_64"
The text was updated successfully, but these errors were encountered:
There's some more info in #1071 , but we've recalled the AMI that included this kernel version. We have a growing number of reports of instability. We're working with Amazon Linux to address the kernel issue, and we've temporarily pinned the kernel in this AMI to the previous, known-stable version (#1072). An AMI release including the pinned kernel is working through our release pipeline now. I'm going to close this issue as a duplicate; we'll add more information to #1071 when we have it. Sorry for the hassle!
Thanks for the answer @cartermckinnon and sorry for the duplicate! I was aware of #1071 but as I wasn't sure whether it was the same issue decided to create a new one.
What happened: Updated our AMI to a new one and now node start to flap between Ready and NotReady after a few hours, as well as network issues like failed liveness probes, DNS resolution problems, etc
How to reproduce it (as minimally and precisely as possible): I don't have a reproduction recipe. This is a cluster that didn't have an issue and where the only change was the AMI version and the issues appeared.
Anything else we need to know?:
Our investigation shows something that looks like a connection leak in the system.
Restarting the docker service in the host drops the socket count to a normal value, but immediately starts to grow again.
Environment:
5.4.217-126.408.amzn2.x86_64 #1 SMP Fri Oct 14 17:08:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
AMI without the issue:
5.4.209-116.363.amzn2.x86_64 \#1 SMP Wed Aug 10 21:19:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: