Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresponsive/unreachable Bottlerocket EKS nodes #4075

Open
Pluies opened this issue Jun 26, 2024 · 33 comments
Open

Unresponsive/unreachable Bottlerocket EKS nodes #4075

Pluies opened this issue Jun 26, 2024 · 33 comments
Labels
status/needs-info Further information is requested status/needs-triage Pending triage or re-evaluation type/bug Something isn't working

Comments

@Pluies
Copy link

Pluies commented Jun 26, 2024

Hey folks,

Coming to you with an odd issue with Bottlerocket EKS nodes becoming unresponsive and unreachable, but still happily running as seen by EC2.

Image I'm using:

ami-089e696e7c541c61b (amazon/bottlerocket-aws-k8s-1.29-aarch64-v1.20.2-536d69d0)

System Info:
  Kernel Version:             6.1.90
  OS Image:                   Bottlerocket OS 1.20.2 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.31+bottlerocket
  Kubelet Version:            v1.29.1-eks-61c0bbb
  Kube-Proxy Version:         v1.29.1-eks-61c0bbb

We use Karpenter as a node provisioner.

What I expected to happen:

Node runs smoothly.

What actually happened:

A large node (m7g.8xlarge) is running a memory-heavy JVM workload from a StatefulSet that takes up most of the machine (request=115Gi out of 128Gi).

Node runs happily for a while (~ 24h to several days), until it suddenly stops reporting status to Kubernetes:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

Kubernetes marks that node as NotReady and taints it with node.kubernetes.io/unreachable, then tries to delete the pod running our large workload. That doesn't work (the node is fully unresponsive), so that pod is stuck in status Terminating.

The node is unreachable via kubectl debug node ... (which times out), or via AWS SSM (which complains of "SSM Agent is not online", "Ping status: Connection lost"). The EC2 Serial Console is empty (all black). We do not have SSH enabled on these machines.

However, the node still appears as running from the EC2 console, reachability metrics are green, and we can see CPU / network metrics flowing in.

I've managed to get system logs by taking an EBS snapshot of the Bottlerocket data volume and restoring it to a separate volume for investigation. This was not helpful unfortunately: logs appear normal until the time the node dies, then suddenly stop. There is no indication that anything in particular (kubelet, containerd, etc) crashed and brought the instance down, but also, suspisciously, no logs at all from the moment the node went unresponsive.

How to reproduce the problem:

No clear way to reproduce unfortunately; we've seen this happen sporadically on maybe half a dozen instances over the past few weeks, out of several hundred

This issue is very annoying: I don't mind having pods crashing and/or getting evicted sometimes, or even kubelet/containerd crashing, but I'd expect it to self-heal eventually. This causes our workloads to get stuck, and we have to manually delete the pod and/or the node to get it back to normal. But even worse, I can't see a way to debug it properly or get to the bottom of it.

Would you have any idea of a better way to debug this?

Thank you!

(Note: I also opened an AWS support ticket to see if there's any AWS-level issue at play here, but this seems to happen only on this specific workload on Bottlerocket nodes, so I suspect something is off here)

@Pluies Pluies added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Jun 26, 2024
@yeazelm
Copy link
Contributor

yeazelm commented Jun 26, 2024

Thanks for opening this issue @Pluies. This does look concerning and we would like to get to the bottom of this! I'll look internally for that support ticket to see if we can find out what is happening.

@Pluies
Copy link
Author

Pluies commented Jun 26, 2024

Okay, I've got good news - after 13 days of being in this state, one node came back to life! Well, kind of. We received a NodeClockNotSynchronising alert, and realised the prometheus node exporter started responding to scrapes again (???).

Apart from that, no change. The node is still showing as NotReady and unreachable, not reachable via kubectl debug or AWS SSM, etc.

I've taken a new snapshot of the bottlerocket data volume, and I'm seeing new logs in the journal though; attaching it here.
journal-after-rebirth.log

I think this has the meat of it (kernel messages + OOM kills).

There are some more messages later showing kubelet errors; I assume the node is so far gone at this stage that it's unable to refresh its credentials and rejoin the cluster.

@yeazelm
Copy link
Contributor

yeazelm commented Jun 27, 2024

Those logs do help. It looks like this could be related to an EBS availability since it seems to be related to hanging tasks with the storage driver:

INFO: task jbd2/nvme2n1-8:6604 blocked for more than 122 seconds.
kernel: fbcon: Taking over console
kernel:       Not tainted 6.1.90 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:jbd2/nvme2n1-8  state:D stack:0     pid:6604  ppid:2      flags:0x00000008
kernel: Call trace:
kernel:  __switch_to+0xbc/0xf0
kernel:  __schedule+0x258/0x6e0
kernel:  schedule+0x58/0xf0
kernel:  jbd2_journal_wait_updates+0x88/0x130
kernel:  jbd2_journal_commit_transaction+0x264/0x1bc0
kernel:  kjournald2+0x15c/0x470
kernel:  kthread+0xd0/0xe0

Can you confirm how many instances this happened to and what storage type you are using for your volumes? I'll keep digging in as best I can as well!

@Pluies
Copy link
Author

Pluies commented Jun 27, 2024

Sure, storage-wise the machine has 4 EBS volumes attached:

  • For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)
  • For our workload: two large PVCs provisioned by the AWS EBS CSI driver. Both are gp3, one is formatted as ext4 (2000GB), the other is xfs (500GB)
  • Additionally, the workload mounts an EFS volume and it's configured to write a java heap dump to this volume in case of Java OOM

To keep everyone in sync, AWS Support seems to think this is due to memory pressure. The large workload has no memory limit; so this is definitely a possibility. We've been working off the assumption that if they do use too much memory, we would still be protected by the OOMKiller and other limits such as kube-reserved and system-reserved memory; but I realise this assumption might be wrong.

The metrics also showed this re memory pressure:

# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 683567.871622
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 3823.9046439999997
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 18942.701305
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 1.141417592027e+06
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 1.141678870877e+06

The memory-related stall, 1.141678870877e+06 seconds, corresponds to just over 13 days, which corresponds pretty well to how long the node was dead for. This might just be a coincidence, if this is shared by all processes, but interesting nonetheless.

On another node (not dead 😄 ) with the same workload, these metrics come up as:

# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 10647.531140000001
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 5830.272114
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 10842.729424000001
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 4560.770407999999
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 4941.830446999999

Anyway, I'm now adding a memory limit to the large workload, so if this is memory-related it should fix the issue for us. 👍

If this is the case though, is it still potentially a bottlerocket issue? I'd have expected either a full crash or eventual self-healing, but we ended up in this half-broken state instead. On the other hand, it would also be reasonable for Bottlerocket to not support unbounded workloads.

@yeazelm
Copy link
Contributor

yeazelm commented Jun 28, 2024

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)

One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.

AWS Support seems to think this is due to memory pressure.

I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?

@Pluies
Copy link
Author

Pluies commented Jul 1, 2024

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)

One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.

Thank you for that! I'll bring it down 👍

AWS Support seems to think this is due to memory pressure.

I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?

None in particular, here's the podspec:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    dune.com/tagger: dune:subproject=cluster-1
  creationTimestamp: "2024-06-28T14:15:40Z"
  generateName: cluster-1-q5mjg-1-workers-
  labels:
    app.kubernetes.io/created-by: trino-operator
    app.kubernetes.io/instance: cluster-1-q5mjg-1
    app.kubernetes.io/name: trinocluster
    app.kubernetes.io/part-of: trino-operator
    app.kubernetes.io/role: worker
    apps.kubernetes.io/pod-index: "9"
    controller-revision-hash: cluster-1-q5mjg-1-workers-5d877d47f9
    dune.com/clusterset: cluster-1
    statefulset.kubernetes.io/pod-name: cluster-1-q5mjg-1-workers-9
  name: cluster-1-q5mjg-1-workers-9
  namespace: query-engine
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: cluster-1-q5mjg-1-workers
    uid: b8b5ba7c-4457-4216-9aa7-31613391ab38
  resourceVersion: "2297524264"
  uid: f5f3ec89-5cdf-4eca-9bd1-d130bd36b1a2
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dunetech.io/nodegroup
            operator: In
            values:
            - trino
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - m7g.8xlarge
          - key: karpenter.sh/capacity-type
            operator: In
            values:
            - on-demand
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - eu-west-1b
        - matchExpressions:
          - key: dunetech.io/nodegroup
            operator: In
            values:
            - trino
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - m6g.8xlarge
          - key: karpenter.sh/capacity-type
            operator: In
            values:
            - on-demand
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - eu-west-1b
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/created-by: trino-operator
            app.kubernetes.io/instance: cluster-1-q5mjg-1
            app.kubernetes.io/name: trinocluster
            app.kubernetes.io/part-of: trino-operator
            app.kubernetes.io/role: worker
        topologyKey: kubernetes.io/hostname
  containers:
  - env:
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: AWS_REGION
      value: eu-west-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::aws-account-id-redacted:role/prod_eks_trino
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: aws-account-id-redacted.dkr.ecr.eu-west-1.amazonaws.com/trino:2024-06-28T08-59-33-main-57ed4c8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 20
      httpGet:
        path: /v1/info
        port: http
        scheme: HTTP
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 60
    name: trino
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    - containerPort: 9000
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /v1/info
        port: http
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: "30"
        memory: 115Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/trino
      name: config-volume
    - mountPath: /etc/trino/catalog
      name: catalog-volume
    - mountPath: /data/view-overrides
      name: view-overrides
      readOnly: true
    - mountPath: /etc/trino-event-listeners/event-listener.properties
      name: dune-event-listener
      readOnly: true
      subPath: event-listener.properties
    - mountPath: /cache
      name: cache
    - mountPath: /etc/trino-resource-groups/resource-groups.json
      name: resource-groups
      readOnly: true
      subPath: resource-groups.json
    - mountPath: /heapdumps/
      name: heapdumps
    - mountPath: /var/lib/trino/prometheus-exporter
      name: prometheus-exporter-config-volume
    - mountPath: /var/lib/trino/spill
      name: spilltodisk
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-4mxl5
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: cluster-1-q5mjg-1-workers-9
  nodeName: ip-10-0-56-162.eu-west-1.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsUser: 1000
  serviceAccount: trino
  serviceAccountName: trino
  subdomain: cluster-1-q5mjg-1-workers
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: dunetech.io/nodegroup
    operator: Equal
    value: trino
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - name: cache
    persistentVolumeClaim:
      claimName: cache-cluster-1-q5mjg-1-workers-9
  - name: spilltodisk
    persistentVolumeClaim:
      claimName: spilltodisk-cluster-1-q5mjg-1-workers-9
  - configMap:
      defaultMode: 420
      name: cluster-1-q5mjg-1-workers
    name: config-volume
  - name: catalog-volume
    secret:
      defaultMode: 420
      secretName: cluster-1-catalogs-ee5f0
  - configMap:
      defaultMode: 420
      name: trino-view-overrides
    name: view-overrides
  - name: dune-event-listener
    secret:
      defaultMode: 420
      secretName: cluster-1-event-listener-7476a
  - configMap:
      defaultMode: 420
      name: cluster-1-resource-group-a800e
    name: resource-groups
  - name: heapdumps
    persistentVolumeClaim:
      claimName: trino-heapdumps
  - configMap:
      defaultMode: 420
      name: cluster-1-prometheus-exporter
    name: prometheus-exporter-config-volume
  - name: kube-api-access-4mxl5
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

Since we added limits to the podspec above (3 days ago) nodes have been solid. I'll report back here later to confirm whether this fixed everything 👍

I was wondering whether it could be replicated, so I ran a stress-ng pod with request=115Gi and no limit and made it eat up memory... But no, everything behaves well in this case: the pod is OOMKilled once it reaches the threshold as expected, and the node stays healthy.

@ginglis13 ginglis13 added the status/needs-info Further information is requested label Jul 1, 2024
@yeazelm
Copy link
Contributor

yeazelm commented Jul 2, 2024

Thanks for the detailed response @Pluies . It is curious that stress-ng didn't reproduce this but your workload does. Can you clarify how you set limits so that if others run across this issue, they have a possible work around to try? I'd like to document this since it seems like the memory limit might be a good way to deal with this.

I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the priority: 0 or preemptionPolicy: PreemptLowerPriority could change this behavior. I'm also curious to see if cgroupsv2 could have some nuance, using an earlier k8s variant (1.25 or earlier) or changing back to cgroups v1 might a useful experiment. One other experiment we might try is using the AL EKS Optimized AMI to see if it reproduces there as well (since they can share the same kernel but differ in a few minor ways). You have already provided a ton of info so thank you and no pressure to try these out, I'm mostly documenting these for next steps on what to try next since we still haven't found out why the nodes hang like they do under this memory pressure.

@Pluies
Copy link
Author

Pluies commented Jul 2, 2024

Thanks for the detailed response @Pluies . It is curious that stress-ng didn't reproduce this but your workload does. Can you clarify how you set limits so that if others run across this issue, they have a possible work around to try? I'd like to document this since it seems like the memory limit might be a good way to deal with this.

Absolutely, our fix was to add the limit block to our resource declaration in the podspec, i.e.:

    resources:
      requests:
        cpu: "30"
        memory: 115Gi
      limits:
        memory: 118Gi

The value of 118Gi was picked so as to be very close to the full amount of allocatable memory on the node, as retrieved with kubectl describe node. It can also be computed manually by taking the node memory and subtracting the kube-reserved memory and system-reserved memory and the memory eviction threshold.

I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the priority: 0 or preemptionPolicy: PreemptLowerPriority could change this behavior. I'm also curious to see if cgroupsv2 could have some nuance, using an earlier k8s variant (1.25 or earlier) or changing back to cgroups v1 might a useful experiment. One other experiment we might try is using the AL EKS Optimized AMI to see if it reproduces there as well (since they can share the same kernel but differ in a few minor ways). You have already provided a ton of info so thank you and no pressure to try these out, I'm mostly documenting these for next steps on what to try next since we still haven't found out why the nodes hang like they do under this memory pressure.

I'm afraid I'm not going to be able to test these here as the issues happen on our prod environments and would directly impact paying customers, but these are good avenues of investigation for whoever wants to pick this up in the future 🙏

Thank you for your help @yeazelm !

@jessebye
Copy link

jessebye commented Jul 3, 2024

We have noticed this happening in one of our clusters in the last week as well. A few observations:

  • We use Karpenter.
  • The problem nodes have all been running Bottlerocket OS 1.20.3.
  • We do not suspect any high cpu/memory pressure on these nodes. Workloads are variable but it's in a Dev cluster where most things have very little load. Now that I'm looking deeper into this, I have noticed that many of the pods where this happens have memory pressure and/or SystemOOM events immediately before they crash. So it does appear to be memory related.
  • All Most of the nodes that I've checked so far have been Graviton, typically latest gen (e.g. c7g). We did see this happen on a c5.xlarge too.
  • The node will stay running in a NotReady state indefinitely until someone cleans it up manually. It doesn't seem like kubelet is able to recover (i.e. by restarting).

Example node in bad state:

Name:               ip-10-2-101-118.us-east-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=c7g.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2c
                    k8s.io/cloud-provider-aws=e1616aad6d45db84259aa7890511291f
                    karpenter.k8s.aws/instance-category=c
                    karpenter.k8s.aws/instance-cpu=4
                    karpenter.k8s.aws/instance-cpu-manufacturer=aws
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=true
                    karpenter.k8s.aws/instance-family=c7g
                    karpenter.k8s.aws/instance-generation=7
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=8192
                    karpenter.k8s.aws/instance-network-bandwidth=1876
                    karpenter.k8s.aws/instance-size=xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/nodepool=default
                    karpenter.sh/registered=true
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=ip-10-2-x-x.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c7g.xlarge
                    role=application
                    topology.ebs.csi.aws.com/zone=us-east-2c
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2c
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.2.x.x
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-redacted","efs.csi.aws.com":"i-redacted"}
                    karpenter.k8s.aws/ec2nodeclass-hash: 13086204471037975526
                    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
                    karpenter.sh/nodepool-hash: 647773607843705194
                    karpenter.sh/nodepool-hash-version: v2
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 02 Jul 2024 17:46:19 -0700
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-2-x-x.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 Jul 2024 11:28:47 -0700
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   10.2.x.x
  InternalDNS:  ip-10-2-x-x.us-east-2.compute.internal
  Hostname:     ip-10-2-x-x.us-east-2.compute.internal
Capacity:
  cpu:                4
  ephemeral-storage:  81854Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             7947912Ki
  pods:               58
Allocatable:
  cpu:                3920m
  ephemeral-storage:  76173383962
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             6931080Ki
  pods:               58
System Info:
  Machine ID:                 ec2xxxx8bb
  System UUID:                ecxxxxbb
  Boot ID:                    92xxxx16
  Kernel Version:             6.1.92
  OS Image:                   Bottlerocket OS 1.20.3 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.31+bottlerocket
  Kubelet Version:            v1.29.1-eks-61c0bbb
  Kube-Proxy Version:         v1.29.1-eks-61c0bbb
ProviderID:                   aws:///us-east-2c/i-06xxxx
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  aws-controllers             s3-tc-controller-manager-64bddc98cc-fknbs               15m (0%)      1 (25%)     128Mi (1%)       256Mi (3%)     12h
  brave-quilt                 api-redis-master-0                                      200m (5%)     500m (12%)  192Mi (2%)       640Mi (9%)     3h59m
  brave-quilt                 authentication-service-redis-master-0                   200m (5%)     500m (12%)  192Mi (2%)       512Mi (7%)     112m
  brave-quilt                 securities-service-redis-master-0                       200m (5%)     500m (12%)  192Mi (2%)       512Mi (7%)     3h59m
  bright-wrench               dataplatform-airflow-worker-0                           200m (5%)     1 (25%)     1088Mi (16%)     2432Mi (35%)   4h39m
  calm-gender                 securities-service-5cdc94bf8c-2hj2g                     700m (17%)    750m (19%)  1216Mi (17%)     1792Mi (26%)   17h
  exit-karate                 integrations-service-mysql-0                            100m (2%)     500m (12%)  64Mi (0%)        384Mi (5%)     12h
  exit-karate                 trading-api-redis-master-0                              200m (5%)     500m (12%)  128Mi (1%)       448Mi (6%)     12h
  infra-test                  securities-service-6c44fc7fc-qlgmg                      700m (17%)    750m (19%)  1216Mi (17%)     1792Mi (26%)   12h
  jonesgang                   integrations-service-mysql-0                            100m (2%)     500m (12%)  64Mi (0%)        384Mi (5%)     12h
  kube-system                 aws-node-jhbcv                                          50m (1%)      0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 ebs-csi-node-w5cn2                                      30m (0%)      0 (0%)      120Mi (1%)       768Mi (11%)    17h
  kube-system                 efs-csi-node-thtt9                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 eks-pod-identity-agent-bz5xp                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 kube-proxy-wm5rg                                        100m (2%)     0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 node-local-dns-z5dnh                                    25m (0%)      0 (0%)      5Mi (0%)         0 (0%)         17h
  legal-horsea                integrations-service-mysql-0                            100m (2%)     500m (12%)  64Mi (0%)        384Mi (5%)     28m
  mlaskowski-cc2              shared-zookeeper-2                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  monitoring                  aws-for-fluent-bit-52jf5                                50m (1%)      0 (0%)      50Mi (0%)        250Mi (3%)     17h
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-876f5    0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  quirky-laundry              shared-zookeeper-0                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         5h41m
  sstrozier                   securities-service-57f9db5bfb-wwgr2                     700m (17%)    750m (19%)  1216Mi (17%)     1792Mi (26%)   17h
  worst-druid                 authentication-service-redis-master-0                   200m (5%)     500m (12%)  192Mi (2%)       512Mi (7%)     4h39m
  worst-druid                 shared-zookeeper-1                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3870m (98%)   8250m (210%)
  memory             6127Mi (90%)  12858Mi (189%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
  hugepages-32Mi     0 (0%)        0 (0%)
  hugepages-64Ki     0 (0%)        0 (0%)
Events:
  Type    Reason                   Age                    From             Message
  ----    ------                   ----                   ----             -------
  Normal  DisruptionBlocked        31m (x7 over 17h)      karpenter        Cannot disrupt Node: Nominated for a pending pod
  Normal  Unconsolidatable         25m                    karpenter        Can't remove without creating 2 candidates
  Normal  DisruptionBlocked        19m                    karpenter        Cannot disrupt Node: PDB "infra-test/securities-service" prevents pod evictions
  Normal  DisruptionBlocked        15m (x2 over 12h)      karpenter        Cannot disrupt Node: PDB "mlaskowski-cc2/shared-zookeeper" prevents pod evictions
  Normal  DisruptionBlocked        13m (x3 over 28m)      karpenter        Cannot disrupt Node: PDB "worst-druid/shared-zookeeper" prevents pod evictions
  Normal  NodeNotReady             10m                    kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeNotReady
  Normal  NodeHasNoDiskPressure    10m (x3 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientMemory  10m (x3 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasSufficientPID     10m (x3 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeReady                10m (x2 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeReady
  Normal  DisruptionBlocked        9m28s (x3 over 17m)    karpenter        Cannot disrupt Node: PDB "sstrozier/securities-service" prevents pod evictions
  Normal  NodeNotReady             8m55s (x2 over 14m)    node-controller  Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeNotReady
  Normal  DisruptionBlocked        5m18s (x2 over 7m23s)  karpenter        Cannot disrupt Node: PDB "calm-gender/securities-service" prevents pod evictions

@igorbdl
Copy link

igorbdl commented Jul 4, 2024

We've also seen these symptoms happening today on our cluster for the first time. Also running Bottlerocket OS 1.20.3. Based on the container_memory_working_set_bytes analysis for the node before it went out, it seems that there were major container memory usage step ups for the kubernetes-nodes-cadvisor job before the node finally runs out of memory.

@monirul
Copy link
Contributor

monirul commented Jul 9, 2024

@jessebye , and @igorbdl ,

Thank you for bringing this issue to our attention. To ensure we’re not dealing with multiple issues stemming from the same underlying problem, could you please provide additional details about your test environment? Specifically, it would be helpful to know the versions of EKS and Bottlerocket where your setup was functioning correctly, as well as the versions where you’re encountering issues.
Additionally, have you attempted the solution suggested in this discussion thread involving setting a resource limit? If so, did it resolve your issue? Also, does your workload involve a memory-intensive JVM process?

@jessebye
Copy link

jessebye commented Jul 15, 2024

@monirul We saw the issue with EKS 1.29 and Bottlerocket 1.20.3. We've only recently switched to using Bottlerocket, so I don't have a lot of data about other versions.

Regarding the resource limit, yes we found that setting a memory limit for the pod resolved the issue. In our case, we determined it was an Argo CD pod that was causing the issue, so not JVM (Argo is Golang).

@diranged
Copy link

diranged commented Jul 17, 2024

We are seeing this as well. We opened up an AWS support case (@bcressey you can ping me privately for the ticket number) and uploaded node logs to that case. Rolling back to Bottlerocket 1.19.5 / 1.29 solved the issue. I want to add that we not only saw the containerd_shim deadlocking, but we had significant performance degradation of our workloads on the nodes.

(spoke too soon.. rolling back to 1.19.5/1.30 did not fix the issue.. still troubleshooting)

@diranged
Copy link

Update: Our issue stemmed not from the Bottlerocket version, but actually from hammering the local root volume on very old EC2 instances. Our compute management partner launched r5b.24xl instances to satisfy a workload mistakenly, and that workload was writing extremely high amounts of I/O to the local ephemeral storage in our pods.

The interesting thing here is that only these r5b instances exhibited the problems with kubelet becoming non responsive and containderd_shim reporting a deadlock. Other instance were ... slow... but they worked and didn't fundamentally crash.

@ginglis13
Copy link
Contributor

I've been able to reproduce the failure mode shared by both @diranged and @Pluies, and I've written up those reproduction steps in this gist: https://gist.github.com/ginglis13/bf37a32d0f2a70b4ac5d9b9e5a960278

tl;dr: overwhelm a large node (like m7g.8xlarge or r5b.24xlarge) with memory pressure and high disk I/O, and you'll start seeing dmesg logs like:

INFO: task FOO blocked for more than X seconds.

Important to note, I've run the same reproduction on the latest Amazon Linux 2 EKS 1.30 Optimized AMI (ami-0cfd96d646e5535a8) and observed the same failure mode and node state.

I'm going to start bisecting the problem by running this repro across a few Bottlerocket variants to see if there's changes that landed in a specific Bottlerocket version that are causing this, especially something in the kernels / container runtimes.

I've found a few blog posts on this specific error message from the kernel:

Which suggest the settings vm.dirty_ratio and vm.dirty_backgroud_ratio could be tuned to be more aggressive in flushing data to disk in order to avoid this problem. Bottlerocket already sets more aggressive values for these settings than their defaults:

bash-5.1# sysctl -n vm.dirty_background_ratio
10
bash-5.1# sysctl -n vm.dirty_ratio
20

I'm working to run the repro with these settings set to even more aggressive values to see if that could be another workaround, other than the pod resource request limits suggested by @Pluies .

@jessebye @igorbdl Any more specific logs or details you could share about your respective encounters with this issue would be much appreciated in tracking this down. I'm particularly interested in:

  • Do you see "INFO: task FOO blocked for more than X seconds." in dmesg on your nodes too? I want to make sure you're running into the same class of problem (which it sounds like is the case from what you've shared so far, but this log message is the current most common denominator)
  • Are your workloads I/O intensive?
  • Specifically for @igorbdl , what was the last known version of Bottlerocket that was working for you?

@igorbdl
Copy link

igorbdl commented Jul 18, 2024

Hey @ginglis13 👋 Great news that you were able to replicate

I've reviewed the job scheduling and memory limits to avoid this problem so I haven't seen this happening anymore and I haven't collected node logs at the time, so unfortunately I won't be able to help with more details on our occurrences. I'm sorry about that.

It seems we started having this issue since we upgraded from 1.20.2 to 1.20.3, but this sounds odd as other users have reported this issue with 1.20.2, so I could be wrong on this one.

We probably had this issue on our end of node system OOM for quite a while, but the nodes would be replaced as this would happen. The symptoms that lead me to reach out in this issue was that the node would not be flushed by EKS and the ones affected would remain in the cluster in NotReady status indefinitely and I'd have to manually delete the respective EC2 instance and delete the node on the cluster to clear things up. This issue also only seemed to happen with nodes that had pods with PVC's bounded.

@larvacea
Copy link
Member

The kernel message INFO: task FOO blocked for more than XXX seconds means:

  • The task (process) is in state D also known as uninterruptible. Some types of iowait will put a process into this state.
    • Some nominally uninterruptible processes can be killed, as long as the signal is fatal and not intercepted by a user-space signal handler. The hung task detector ignores these processes.
  • The task has not had a context switch in the last XXX seconds.

Since something that hangs for two minutes will quite likely hang forever, the kernel can optionally panic and reboot when the hung task detector fires. If this is a desirable outcome, see these sysctl options:

  • kernel.hung_task_panic=1: the hung task detector will trigger a panic.
  • kernel.panic=1: a panic will trigger reboot after one second.

If your applications are more robust when a node reboots than when a node hangs with unresponsive storage, this may help.

@larvacea
Copy link
Member

If one wants additional logging:

  • Setting the kernel parameter kernel.hung_task_all_cpu_backtrace=1 will add a backtrace-per-CPU to the logging from the hung task detector.
  • Setting kernel.hung_task_panic=1 calls debug_show_all_locks() before the call to panic().

@JacobHenner
Copy link

Hello all:

We're also affected by this issue. We've recently adopted Bottlerocket in a subset of clusters, and have seen at least 5 instances of this failure.

The nodes it has affected have been

  • v1.28.10-eks-890c2ac
  • Bottlerocket OS 1.21.1 (aws-k8s-1.28)
  • 6.1.102
  • containerd://1.7.20+bottlerocket

Since we've only recently adopted Bottlerocket, we don't have data to compare the frequency of this issue between different versions of Bottlerocket. However, I don't yet have reason to believe this is a Bottlerocket-specific problem. We've seen node lockups with similar symptoms for a few years now (at least since 2021), through many versions of k8s and the Kernel, across instance types, across architectures, on AL 2 nodes. I recall that we've seen this affect AL2023 too, but I am less certain of the details for AL2023. For AL2, we previously opened support case 171149566301465, but it did not progress.

For both AL 2 and Bottlerocket, the manifestation has been similar:

  • There is memory pressure on the node
  • Cache memory is reclaimed
  • Cache memory hits some critically low value (which varies)
  • EBS disk IO skyrockets, the node locks up, and the control plane begins reporting NotReady

In most cases the node does not recover and must be terminated. In a few cases the node does recover (in the sense that it is marked Ready and becomes responsive again), but when this happens the node is often broken in some variable and hard-to-discover/hard-to-diagnose way - for example, kubelet and the container runtime will lose sync, a subset of network connections will fail, or containerized application processes will be in some unexpected state.

My theory is that the OOMKiller is essentially not kicking in fast enough when there's memory pressure, and that the extreme reclaim of cache memory leads to cache misses that force lots of disk IO to hit EBS. The EBS iops limit is hit, and then the system stalls on iowait. Eventually the condition improves (if/when the OOMKiller kicks in), but the extended stall and potentially subsequent system OOMKills break kubelet and other services, leading node failure.

Here are some metric visualizations of the problem, this one for a Bottlerocket instance. The AL 2 instance visualizations look similar, although they only have a single EBS volume.

image

Let me know if there are other metrics I can provide or experiments I can run to help get to the bottom of this issue. It'd be such a relief to not have to deal with the consequences of this recurring issue.

@James-Quigley
Copy link

Chiming in to say I've also experienced this same thing. Bottlerocket with EKS (1.28) and Karpenter (latest). Node starts to experience memory pressure and then enters a NotReady state and stays there indefinitely until the underlying EC2 is terminated.

I've been able to repro with test pods running stress, but not consistently. Trying to figure out a way to consistently reproduce the issue

@geddah
Copy link

geddah commented Sep 23, 2024

I am facing the same issue here. The setup I have is:

EC2 instance type: t3.meduim
OS Image: Bottlerocket OS 1.19.1 (aws-k8s-1.29)
Kernel Version: 6.1.72
Container Runtime: containerd://1.6.28+bottlerocket

The problem workload is Prometheus - so this is not an issue with only JVM. The behavior is very similar to what @JacobHenner has described. In my case, if I give "enough" time(many hours) to the node, it comes back up without the need for a reboot.

@jmt-lab
Copy link
Contributor

jmt-lab commented Sep 27, 2024

Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:

resources:
   requests:
      memory: 3Gi
   limits:
      memory: 3.4Gi

You can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this.

@JacobHenner
Copy link

Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:

resources:
   requests:
      memory: 3Gi
   limits:
      memory: 3.4Gi

You can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this.

Unfortunately it's not this simple for us. Our clusters host a variety of workloads, and these workloads are managed by a variety of teams. Application owners are told to set their memory requests and limits appropriately, but for a range of reasons this doesn't always happen. In the case where an application is exceeding its memory request and the node encounters memory pressure, we'd expect that application to be OOMKilled or node-pressure evicted because it is misconfigured. We would not expect other properly-configured applications to suffer because memory pressure has caused the entire node to lock up.

So we wonder - why aren't pods being killed quickly enough under node memory pressure conditions to prevent this issue from occurring? How can we rely on the Kubernetes requests/limits model if the potential consequences of exceeding a request isn't limited to the offending container, but rather the entire node the container is running on? Team multitenancy seems untenable under these conditions.

@geddah
Copy link

geddah commented Sep 30, 2024

Thanks @jmt-lab for the response. Even though configuring limits reduces the number of crashes, they still happen as our workloads grow in number and capacity. I will keep following this thread for the real fix. Let me know if you need any more context that might help your team with the fix. All the best!

@ginglis13
Copy link
Contributor

@JacobHenner @geddah @James-Quigley thank you all for the additional data as we work through a root cause for this issue. One commonality between your problematic variants (and those of previous commenters on this issue) is that the version of containerd in the variants reported here has been susceptible to containerd/containerd#10589. The implication of this issue, as I understand it, is that OOM-killed containers may still appear to containerd as running; this seems to be the inverse of our problem since processes in this bug report are getting OOM-killed, but the fact that this is a bug with the OOM killer makes me suspicious nonetheless.

Bottlerocket v1.23.0 bumped containerd to v1.7.22 with a patch for this bug via bottlerocket-core-kit v2.5.0, which inclues bottlerocket-os/bottlerocket-core-kit#143. Would you be willing to upgrade your Bottlerocket nodes to v1.23.0+ and report back with any additional findings?

@James-Quigley
Copy link

I can confirm that this happened to me this morning on a node running Bottlerocket OS 1.23.0 (aws-k8s-1.28)
Screenshot 2024-10-01 at 3 40 58 PM

Here's a screenshot of the memory metrics for the node/pods at that time.

@remiflament
Copy link

remiflament commented Oct 10, 2024

Same experience of unresponsiveness of ec2.

config:

  • latest bottlerocket 1.24.1
  • flavor: ecs-2 x86_64
  • only using t*.medium burstable instance (dev environment)
  • memory soft limit on ecs task
  • ebs gp2 100 IOPS

I never got the issue on my production environment, same config except ec2 types are c/m * xlarge

For now my current test is to move my ebs config to gp3 for the 3000 IOPS. I will see if I continue to see the problem.

@ytsssun
Copy link
Contributor

ytsssun commented Oct 14, 2024

Repro

I was able to reproduce the behavior by running a custom program (eatmem) that aggressively consumes all the host memory gradually (400MB per second). The eatmem program - https://gist.github.com/ytsssun/b264c49cdfb8959b6d50b07c9cf4cb32.

I ran a process to monitor the memory usage:

while true; do free -m | grep Mem | while read -r _ total used _; do percent=$((used * 100 / total)); printf "Memory Usage: %s / %s (%d%%)\n" "${used}M" "${total}M" "$percent"; done; sleep 1; done

The node became unresponsive after the memory consumed hits 100% and it never came back. kubectl shows node status as NotReady and stuck in that state.

Attempt for mitigation

I tried deploying the userspace oom-killer as DaemonSet for my cluster. It simply manually trigger the oom-killer by echo f > /host_proc/sysrq-trigger when it detects the memory is below a certain threshold.

After that, I ran the eatmem program on a healthy node, it went frozen for about 8min and it came back. I had to adjust the MIN_MEMORY (to say 1000MB) and the interval for memory usage detection (to 1s) for my use case. You can see my adjusted oom-killer.

I recommend giving this lightweight oom-killer a try and see if that would help with the issue. The oom-killer would try to kill process based on the oom_score_adj computed for it. Process with highest oom_score_adj would get killed first.

oom-killer log analysis

Below are more details by analyzing the dmesg log.

  1. Sysrq manual trigger invoked at timestamp 239.457726.
[  239.457726] sysrq: Manual OOM execution
  1. Initially, oom-killer killed process which has oom_score_adj of 1000.
[  239.458525] Out of memory: Killed process 5773 (mpi-operator) total-vm:1871132kB, anon-rss:11272kB, file-rss:36820kB, shmem-rss:0kB, UID:0 pgtables:288kB oom_score_adj:1000
  1. Subsequent oom-killing events, killed a few other processes that has higher oom_score_adj.
[  526.448344] Out of memory: Killed process 6475 (free) total-vm:23112kB, anon-rss:128kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:1000
[  542.664490] Out of memory: Killed process 6473 (sh) total-vm:4508kB, anon-rss:108kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1000
[  542.850447] Out of memory: Killed process 2899 (sh) total-vm:4508kB, anon-rss:104kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1000
[  542.970388] Out of memory: Killed process 5611 (coredns) total-vm:1283516kB, anon-rss:9900kB, file-rss:0kB, shmem-rss:0kB, UID:65532 pgtables:216kB oom_score_adj:996
[  542.971686] Out of memory: Killed process 5644 (coredns) total-vm:1283516kB, anon-rss:9900kB, file-rss:0kB, shmem-rss:0kB, UID:65532 pgtables:216kB oom_score_adj:996
[  545.663240] Out of memory: Killed process 2357 (amzn-guardduty-) total-vm:929236kB, anon-rss:35652kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:248kB oom_score_adj:984
  1. The oom-killer found eatmem process and killed it.
[  719.286731] Out of memory: Killed process 5978 (eatmem) total-vm:17801320kB, anon-rss:15257072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:29996kB oom_score_adj:1
  1. The system shows recovery:
[  731.291144] IPv6: ADDRCONF(NETDEV_CHANGE): eni7e37f784474: link becomes ready
[  731.291178] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  734.740979] pci 0000:00:06.0: [1d0f:ec20] type 00 class 0x020000
[  734.741076] pci 0000:00:06.0: reg 0x10: [mem 0x00000000-0x00003fff]
[  734.741148] pci 0000:00:06.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[  734.741323] pci 0000:00:06.0: enabling Extended Tags
[  734.742062] pci 0000:00:06.0: BAR 2: assigned [mem 0xc1600000-0xc16fffff pref]
[  734.742085] pci 0000:00:06.0: BAR 0: assigned [mem 0xc1514000-0xc1517fff]
[  734.742146] ena 0000:00:06.0: enabling device (0000 -> 0002)
[  734.749973] ena 0000:00:06.0: ENA device version: 0.10
[  734.749977] ena 0000:00:06.0: ENA controller version: 0.0.1 implementation version 1
[  734.849623] ena 0000:00:06.0: ENA Large LLQ is disabled
[  734.861392] ena 0000:00:06.0: Elastic Network Adapter (ENA) found at mem c1514000, mac addr 02:82:87:02:9b:6d
[  734.926653] ena 0000:00:06.0 eth1: Local page cache is disabled for less than 16 channels

Time between SysRq trigger and initial recovery
731.291144 - 239.457726 = 491.833418 seconds ≈ 8 minutes 11 seconds

@bcressey
Copy link
Contributor

bcressey commented Oct 14, 2024

@ytsssun for testing purposes the "lightweight oom-killer" might be OK, but it has some notable defects:

  1. it will invoke the OOM killer every second if the system is below 1 GiB free memory; this could happen constantly on nodes that are otherwise not seeing a runaway process consuming memory
  2. it seems like the script itself may be getting OOM-killed; the kill of free is suspicious

I don't think this makes sense as the basis for a production solution. I'd look at running something like nohang or earlyoom in a daemonset instead.

@ytsssun
Copy link
Contributor

ytsssun commented Oct 14, 2024

@bcressey is right, this simple oom-killer is for testing purpose. I am evaluating nohang, I will provide my updates once done.

@ytsssun
Copy link
Contributor

ytsssun commented Dec 4, 2024

Sorry for the delay. I was finally get the chance to try the nohang and deploy it as DaemonSet in my nodes. It worked pretty nicely for my setup and kept the node responsive.

Here is my setup.

My full steps to verify the nohang behavior:

Steps:

1. Run the memory intensive pods:

See the pod specs - https://gist.github.com/ytsssun/f5d8d4a6d4926588914e755577452c88. It tries to launch a number of pods that eat up the host memory. Tune the below env

        env:
        - name: TARGET_MB
          value: "100000"
        - name: INTERVAL_SEC
          value: "1"
        - name: STEP_MB
          value: "2000"

The node will become OOM and unresponsive.

Memory Usage: 9480M / 127403M (7%)
Memory Usage: 11762M / 127403M (9%)
Memory Usage: 20420M / 127403M (16%)
Memory Usage: 28783M / 127403M (22%)
Memory Usage: 29928M / 127403M (23%)
Memory Usage: 37879M / 127403M (29%)
Memory Usage: 46548M / 127403M (36%)
Memory Usage: 49605M / 127403M (38%)
Memory Usage: 55230M / 127403M (43%)
Memory Usage: 63873M / 127403M (50%)
Memory Usage: 69669M / 127403M (54%)
Memory Usage: 72416M / 127403M (56%)
Memory Usage: 81026M / 127403M (63%)
Memory Usage: 92386M / 127403M (72%)
Memory Usage: 101601M / 127403M (79%)
Memory Usage: 120439M / 127403M (94%)
Memory Usage: 124730M / 127403M (97%)
Memory Usage: 121418M / 127403M (95%)
Memory Usage: 126732M / 127403M (99%)
20:22:32 [ERROR] Failed to read from WebSocket: IO error: Connection reset by peer (os error 104)
20:22:32 [WARN] Didn't receive a return code or signal; unsure what happened
[ssm-user@control]$ exit

Node stuck at NotReady status

 Normal   NodeNotReady  16s                   node-controller  Node ip-192-168-189-126.us-west-2.compute.internal status is now: NodeNotReady

2. In a fresh node, run the nohang setup, verify that the nohang Daemonset is running on the target node:

kube-system        nohang-dp9vb                          1/1     Running   0          12s

Now run the same workload as mentioned in step 1.

The node stayed responsive even when approaching the memory limits. You can see the memory usage fluctuate because the nohang kicked in and killed the processes and kept the node responsive.

Memory Usage: 124347M / 127403M (97%)
Memory Usage: 119371M / 127403M (93%)
Memory Usage: 126992M / 127403M (99%)
Memory Usage: 93590M / 127403M (73%)
Memory Usage: 98357M / 127403M (77%)
Memory Usage: 108241M / 127403M (84%)
Memory Usage: 117202M / 127403M (91%)
Memory Usage: 123582M / 127403M (97%)
Memory Usage: 126118M / 127403M (98%)
Memory Usage: 103571M / 127403M (81%)
Memory Usage: 119292M / 127403M (93%)
Memory Usage: 127384M / 127403M (99%)
Memory Usage: 96542M / 127403M (75%)

Logs in nohang pod:

>>=== STARTING implement_corrective_action() ====>>
Memory status that requires corrective actions:
  MemAvailable [4026 MiB, 3.2 %] <= soft_threshold_min_mem [6370 MiB, 5.0 %]
  SwapFree [0 MiB, 0.0 %] <= soft_threshold_min_swap [0 MiB, - %]
Found 121 tasks with non-zero oom_score (except init and self) in 9ms
TOP-15 tasks by badness:
  Name                PID badness
  --------------- ------- -------
  mem-eater       1607829    1438
  mem-eater       1607962    1437
  mem-eater       1608264    1400
  mem-eater       1608402    1399
  mem-eater       1608542    1382
  mem-eater       1608679    1379
  mem-eater       1608681    1379
  mem-eater       1608954    1356
  mem-eater       1609069    1354
  amzn-guardduty-    5139    1332
  coredns            7415    1332
  coredns            7530    1332
  sleep              7633    1332
  containerd-shim    4178     667
  containerd-shim    4179     667
TOP printed in 0ms; process with highest badness:
  PID: 1607829, name: mem-eater, badness: 1438
Recheck memory levels...
Memory status that requires corrective actions:
  MemAvailable [4094 MiB, 3.2 %] <= soft_threshold_min_mem [6370 MiB, 5.0 %]
  SwapFree [0 MiB, 0.0 %] <= soft_threshold_min_swap [0 MiB, - %]
Victim status (found in 97ms):
  PID:       1607829, name: mem-eater, state: S (sleeping), EUID: 0, SID: 1607829 (mem-eater), lifetime: 17.7s
  badness:   1438, oom_score:  1438, oom_score_adj:  1000
  Vm, MiB:   Size: 21379, RSS: 20023 (Anon: 20023, File: 0, Shmem: 0), Swap: 0
  cgroup_v1: /
  cgroup_v2: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod96b00689_df91_4874_88aa_d9f50391aa3c.slice/cri-containerd-d28dc0ac8925ebfca24868ab10c5fb64669dab17a7ba2b515d4336c52fc409bf.scope
  ancestry:  PID 1601608 (containerd-shim) <= PID 1 (systemd)
  exe realpath: /app/mem-eater
  cwd realpath: /app
Memory info, MiB:
  total=127403, used=121572, free=5633, available=4907, shared=5, buffers=0, cache=198,
  swap_total=0, swap_used=0, swap_free=0
Memory pressure (system-wide):
  some avg10=8.19 avg60=10.82 avg300=6.32
  full avg10=8.16 avg60=10.73 avg300=6.27
Implementing a corrective action:
  Sending SIGTERM to the victim
OK; total response time: 109ms
What happened in the last 1min 11s:
  [ OK ] Sending SIGKILL to mem-eater: 24
  [ OK ] Sending SIGTERM to mem-eater: 14
The victim died in 0.01s
Memory status after implementing a corrective action:
  MemAvailable: 5106.5 MiB, SwapFree: 0.0 MiB
<<=== FINISHING implement_corrective_action() ===<<

Feel free to give this a try.

@cyrus-mc
Copy link

We to have been experiencing this issue.

Details

  • Bottlerocket version 1.26.2
  • EKS version 1.3.0
  • Running Karpenter as cluster autoscaler

This initially started on memory intensive jobs used for build pipelines. Outside of the required daemonsets only a single workload POD was scheduled on these nodes - the GitHub runner.

The node would move to a NotReady status and the PODs would get stuck in a Terminating status. Manual deletion of the node was required to resolve (I think the longest we kept a node around before cleaning up was a day or so).

As noted by others we resolved this by setting a memory limit such that the job was killed prior to exhausting all the node memory resources. This got us resolution for a while, until we started to see this crop up in other areas not related to nodes/PODs specifically used for builds. The first such case was our Prometheus server. Prometheus memory consumption grows as services + nodes increase. One of our clusters scales quite heavily and this scale up in nodes caused the memory consumption of Prometheus to spike, which caused the node to experience memory pressure and die.

EC2 metrics

Screenshot 2024-12-19 at 9 36 22 AM

You can see that all network stats drop to 0 at the time the node goes offline. CPU seems to stay flat, although we have also seen cases where the CPU shows a steady 100%. Prometheus metrics for this. node stop reporting at the time of failure.

I have attempted to reproduce (not as thorough as others on this thread have) by creating a simple Go process that requests large chunks of memory. What I see happen is what I would expect to happen. When we consume 100% (or close to that) of the node memory, it temporarily goes offline, OOM killer is invoked, process is killed, and the node recovers. This recovery happens in the span of 10 - 20 seconds - sometimes less.

One thing I did note, and I may just be interpreting this incorrectly, is that I tried to trigger a hard eviction threshold by consuming just enough memory to have memory.available < 100Mi (which is the default hard eviction threshold for Kubelet).

bash-5.1# free -m
               total        used        free      shared  buff/cache   available
Mem:            1862        1807          63           2         132          55
Swap:              0           0           0

This stat line is a bit odd to me as free is usually less than available, as available is what is available to processes by evicting cache. So how that can be greater than free - which is memory that is currently not used at all - doesn't make sense. That being said, the box in the above case has < 100Mi of available memory. I would expect that to trigger a NodeMemoryPressure state and the kubelet to kick in and try to free memory. But I don't see that at all. The node just continues to function as is with not state change. It isn't until I consume more memory that the OOM killer eventually kicks off.

I have been a user of Bottlerocket for years and I do not recall running into this situation previously. It seems to have only surfaced within the last 9 months or so.

@petewilcock
Copy link

We're seeing this on nodes running Kafka, which have intermittent high network spikes which seem to break kubelet connection. My observation is that this seems to be happening on smaller nodes, e.g. t3.small where I'm guessing there's some kind of network saturation happening which causes some aspect of networking to crash and never recover.

My fix for now is to limit which nodes these workloads are scheduled on, favouring those with higher network capacity to hopefully avoid this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/needs-info Further information is requested status/needs-triage Pending triage or re-evaluation type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests