Socket leak in Docker service #1078

kanor1306 · 2022-11-01T14:01:37Z

What happened: Updated our AMI to a new one and now node start to flap between Ready and NotReady after a few hours, as well as network issues like failed liveness probes, DNS resolution problems, etc

How to reproduce it (as minimally and precisely as possible): I don't have a reproduction recipe. This is a cluster that didn't have an issue and where the only change was the AMI version and the issues appeared.

Anything else we need to know?:
Our investigation shows something that looks like a connection leak in the system.

Socket count constantly increases until it breaks the node. See screenshot:

Socket count in the pods is regular, no change from the previous AMI

Restarting the docker service in the host drops the socket count to a normal value, but immediately starts to grow again.

Environment:

AWS Region: us-east-1
Instance Type(s):
EKS Platform version: eks.11
Kubernetes version: 1.21
AMI Version:
Kernel: 5.4.217-126.408.amzn2.x86_64 #1 SMP Fri Oct 14 17:08:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Container runtime version: Docker 20.10.17
Release information

BASE_AMI_ID="ami-02dd04850de58599e"
BUILD_TIME="Mon Sep 26 21:55:27 UTC 2022"
BUILD_KERNEL="5.4.209-116.367.amzn2.x86_64"
ARCH="x86_64"

AMI without the issue:

Kernel: 5.4.209-116.363.amzn2.x86_64 \#1 SMP Wed Aug 10 21:19:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Release information

BASE_AMI_ID="ami-0f51d4d93d5bee36b"
BUILD_TIME="Wed Aug 24 01:01:12 UTC 2022"
BUILD_KERNEL="5.4.209-116.363.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2022-11-02T07:22:31Z

There's some more info in #1071 , but we've recalled the AMI that included this kernel version. We have a growing number of reports of instability. We're working with Amazon Linux to address the kernel issue, and we've temporarily pinned the kernel in this AMI to the previous, known-stable version (#1072). An AMI release including the pinned kernel is working through our release pipeline now. I'm going to close this issue as a duplicate; we'll add more information to #1071 when we have it. Sorry for the hassle!

kanor1306 · 2022-11-02T09:48:42Z

Thanks for the answer @cartermckinnon and sorry for the duplicate! I was aware of #1071 but as I wasn't sure whether it was the same issue decided to create a new one.

cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2022

cartermckinnon added the duplicate This issue or pull request already exists label Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket leak in Docker service #1078

Socket leak in Docker service #1078

kanor1306 commented Nov 1, 2022

cartermckinnon commented Nov 2, 2022

kanor1306 commented Nov 2, 2022

Socket leak in Docker service #1078

Socket leak in Docker service #1078

Comments

kanor1306 commented Nov 1, 2022

cartermckinnon commented Nov 2, 2022

kanor1306 commented Nov 2, 2022