-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unresponsive/unreachable Bottlerocket EKS nodes #4075
Comments
Thanks for opening this issue @Pluies. This does look concerning and we would like to get to the bottom of this! I'll look internally for that support ticket to see if we can find out what is happening. |
Okay, I've got good news - after 13 days of being in this state, one node came back to life! Well, kind of. We received a Apart from that, no change. The node is still showing as NotReady and unreachable, not reachable via I've taken a new snapshot of the bottlerocket data volume, and I'm seeing new logs in the journal though; attaching it here. I think this has the meat of it (kernel messages + OOM kills). There are some more messages later showing kubelet errors; I assume the node is so far gone at this stage that it's unable to refresh its credentials and rejoin the cluster. |
Those logs do help. It looks like this could be related to an EBS availability since it seems to be related to hanging tasks with the storage driver:
Can you confirm how many instances this happened to and what storage type you are using for your volumes? I'll keep digging in as best I can as well! |
Sure, storage-wise the machine has 4 EBS volumes attached:
To keep everyone in sync, AWS Support seems to think this is due to memory pressure. The large workload has no memory limit; so this is definitely a possibility. We've been working off the assumption that if they do use too much memory, we would still be protected by the OOMKiller and other limits such as kube-reserved and system-reserved memory; but I realise this assumption might be wrong. The metrics also showed this re memory pressure:
The memory-related stall, 1.141678870877e+06 seconds, corresponds to just over 13 days, which corresponds pretty well to how long the node was dead for. This might just be a coincidence, if this is shared by all processes, but interesting nonetheless. On another node (not dead 😄 ) with the same workload, these metrics come up as:
Anyway, I'm now adding a memory limit to the large workload, so if this is memory-related it should fix the issue for us. 👍 If this is the case though, is it still potentially a bottlerocket issue? I'd have expected either a full crash or eventual self-healing, but we ended up in this half-broken state instead. On the other hand, it would also be reasonable for Bottlerocket to not support unbounded workloads. |
One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.
I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities? |
Thank you for that! I'll bring it down 👍
None in particular, here's the podspec: apiVersion: v1
kind: Pod
metadata:
annotations:
dune.com/tagger: dune:subproject=cluster-1
creationTimestamp: "2024-06-28T14:15:40Z"
generateName: cluster-1-q5mjg-1-workers-
labels:
app.kubernetes.io/created-by: trino-operator
app.kubernetes.io/instance: cluster-1-q5mjg-1
app.kubernetes.io/name: trinocluster
app.kubernetes.io/part-of: trino-operator
app.kubernetes.io/role: worker
apps.kubernetes.io/pod-index: "9"
controller-revision-hash: cluster-1-q5mjg-1-workers-5d877d47f9
dune.com/clusterset: cluster-1
statefulset.kubernetes.io/pod-name: cluster-1-q5mjg-1-workers-9
name: cluster-1-q5mjg-1-workers-9
namespace: query-engine
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: cluster-1-q5mjg-1-workers
uid: b8b5ba7c-4457-4216-9aa7-31613391ab38
resourceVersion: "2297524264"
uid: f5f3ec89-5cdf-4eca-9bd1-d130bd36b1a2
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dunetech.io/nodegroup
operator: In
values:
- trino
- key: node.kubernetes.io/instance-type
operator: In
values:
- m7g.8xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1b
- matchExpressions:
- key: dunetech.io/nodegroup
operator: In
values:
- trino
- key: node.kubernetes.io/instance-type
operator: In
values:
- m6g.8xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1b
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/created-by: trino-operator
app.kubernetes.io/instance: cluster-1-q5mjg-1
app.kubernetes.io/name: trinocluster
app.kubernetes.io/part-of: trino-operator
app.kubernetes.io/role: worker
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: AWS_REGION
value: eu-west-1
- name: AWS_ROLE_ARN
value: arn:aws:iam::aws-account-id-redacted:role/prod_eks_trino
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
image: aws-account-id-redacted.dkr.ecr.eu-west-1.amazonaws.com/trino:2024-06-28T08-59-33-main-57ed4c8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 20
httpGet:
path: /v1/info
port: http
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 60
name: trino
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 9000
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /v1/info
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: "30"
memory: 115Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/trino
name: config-volume
- mountPath: /etc/trino/catalog
name: catalog-volume
- mountPath: /data/view-overrides
name: view-overrides
readOnly: true
- mountPath: /etc/trino-event-listeners/event-listener.properties
name: dune-event-listener
readOnly: true
subPath: event-listener.properties
- mountPath: /cache
name: cache
- mountPath: /etc/trino-resource-groups/resource-groups.json
name: resource-groups
readOnly: true
subPath: resource-groups.json
- mountPath: /heapdumps/
name: heapdumps
- mountPath: /var/lib/trino/prometheus-exporter
name: prometheus-exporter-config-volume
- mountPath: /var/lib/trino/spill
name: spilltodisk
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-4mxl5
readOnly: true
- mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
name: aws-iam-token
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: cluster-1-q5mjg-1-workers-9
nodeName: ip-10-0-56-162.eu-west-1.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1000
runAsGroup: 1000
runAsUser: 1000
serviceAccount: trino
serviceAccountName: trino
subdomain: cluster-1-q5mjg-1-workers
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: dunetech.io/nodegroup
operator: Equal
value: trino
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: aws-iam-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
- name: cache
persistentVolumeClaim:
claimName: cache-cluster-1-q5mjg-1-workers-9
- name: spilltodisk
persistentVolumeClaim:
claimName: spilltodisk-cluster-1-q5mjg-1-workers-9
- configMap:
defaultMode: 420
name: cluster-1-q5mjg-1-workers
name: config-volume
- name: catalog-volume
secret:
defaultMode: 420
secretName: cluster-1-catalogs-ee5f0
- configMap:
defaultMode: 420
name: trino-view-overrides
name: view-overrides
- name: dune-event-listener
secret:
defaultMode: 420
secretName: cluster-1-event-listener-7476a
- configMap:
defaultMode: 420
name: cluster-1-resource-group-a800e
name: resource-groups
- name: heapdumps
persistentVolumeClaim:
claimName: trino-heapdumps
- configMap:
defaultMode: 420
name: cluster-1-prometheus-exporter
name: prometheus-exporter-config-volume
- name: kube-api-access-4mxl5
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace Since we added limits to the podspec above (3 days ago) nodes have been solid. I'll report back here later to confirm whether this fixed everything 👍 I was wondering whether it could be replicated, so I ran a |
Thanks for the detailed response @Pluies . It is curious that I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the |
Absolutely, our fix was to add the limit block to our resource declaration in the podspec, i.e.: resources:
requests:
cpu: "30"
memory: 115Gi
limits:
memory: 118Gi The value of 118Gi was picked so as to be very close to the full amount of allocatable memory on the node, as retrieved with
I'm afraid I'm not going to be able to test these here as the issues happen on our prod environments and would directly impact paying customers, but these are good avenues of investigation for whoever wants to pick this up in the future 🙏 Thank you for your help @yeazelm ! |
We have noticed this happening in one of our clusters in the last week as well. A few observations:
Example node in bad state:
|
We've also seen these symptoms happening today on our cluster for the first time. Also running |
Thank you for bringing this issue to our attention. To ensure we’re not dealing with multiple issues stemming from the same underlying problem, could you please provide additional details about your test environment? Specifically, it would be helpful to know the versions of EKS and Bottlerocket where your setup was functioning correctly, as well as the versions where you’re encountering issues. |
@monirul We saw the issue with EKS 1.29 and Bottlerocket 1.20.3. We've only recently switched to using Bottlerocket, so I don't have a lot of data about other versions. Regarding the resource limit, yes we found that setting a memory limit for the pod resolved the issue. In our case, we determined it was an Argo CD pod that was causing the issue, so not JVM (Argo is Golang). |
We are seeing this as well. We opened up an AWS support case (@bcressey you can ping me privately for the ticket number) and uploaded node logs to that case. (spoke too soon.. rolling back to 1.19.5/1.30 did not fix the issue.. still troubleshooting) |
Update: Our issue stemmed not from the Bottlerocket version, but actually from hammering the local root volume on very old EC2 instances. Our compute management partner launched The interesting thing here is that only these |
I've been able to reproduce the failure mode shared by both @diranged and @Pluies, and I've written up those reproduction steps in this gist: https://gist.github.com/ginglis13/bf37a32d0f2a70b4ac5d9b9e5a960278 tl;dr: overwhelm a large node (like
Important to note, I've run the same reproduction on the latest Amazon Linux 2 EKS 1.30 Optimized AMI (ami-0cfd96d646e5535a8) and observed the same failure mode and node state. I'm going to start bisecting the problem by running this repro across a few Bottlerocket variants to see if there's changes that landed in a specific Bottlerocket version that are causing this, especially something in the kernels / container runtimes. I've found a few blog posts on this specific error message from the kernel:
Which suggest the settings
I'm working to run the repro with these settings set to even more aggressive values to see if that could be another workaround, other than the pod resource request limits suggested by @Pluies . @jessebye @igorbdl Any more specific logs or details you could share about your respective encounters with this issue would be much appreciated in tracking this down. I'm particularly interested in:
|
Hey @ginglis13 👋 Great news that you were able to replicate I've reviewed the job scheduling and memory limits to avoid this problem so I haven't seen this happening anymore and I haven't collected node logs at the time, so unfortunately I won't be able to help with more details on our occurrences. I'm sorry about that. It seems we started having this issue since we upgraded from We probably had this issue on our end of node system OOM for quite a while, but the nodes would be replaced as this would happen. The symptoms that lead me to reach out in this issue was that the node would not be flushed by EKS and the ones affected would remain in the cluster in |
The kernel message
Since something that hangs for two minutes will quite likely hang forever, the kernel can optionally panic and reboot when the hung task detector fires. If this is a desirable outcome, see these sysctl options:
If your applications are more robust when a node reboots than when a node hangs with unresponsive storage, this may help. |
If one wants additional logging:
|
Hello all: We're also affected by this issue. We've recently adopted Bottlerocket in a subset of clusters, and have seen at least 5 instances of this failure. The nodes it has affected have been
Since we've only recently adopted Bottlerocket, we don't have data to compare the frequency of this issue between different versions of Bottlerocket. However, I don't yet have reason to believe this is a Bottlerocket-specific problem. We've seen node lockups with similar symptoms for a few years now (at least since 2021), through many versions of k8s and the Kernel, across instance types, across architectures, on AL 2 nodes. I recall that we've seen this affect AL2023 too, but I am less certain of the details for AL2023. For AL2, we previously opened support case For both AL 2 and Bottlerocket, the manifestation has been similar:
In most cases the node does not recover and must be terminated. In a few cases the node does recover (in the sense that it is marked Ready and becomes responsive again), but when this happens the node is often broken in some variable and hard-to-discover/hard-to-diagnose way - for example, kubelet and the container runtime will lose sync, a subset of network connections will fail, or containerized application processes will be in some unexpected state. My theory is that the OOMKiller is essentially not kicking in fast enough when there's memory pressure, and that the extreme reclaim of cache memory leads to cache misses that force lots of disk IO to hit EBS. The EBS iops limit is hit, and then the system stalls on iowait. Eventually the condition improves (if/when the OOMKiller kicks in), but the extended stall and potentially subsequent system OOMKills break kubelet and other services, leading node failure. Here are some metric visualizations of the problem, this one for a Bottlerocket instance. The AL 2 instance visualizations look similar, although they only have a single EBS volume. Let me know if there are other metrics I can provide or experiments I can run to help get to the bottom of this issue. It'd be such a relief to not have to deal with the consequences of this recurring issue. |
Chiming in to say I've also experienced this same thing. Bottlerocket with EKS (1.28) and Karpenter (latest). Node starts to experience memory pressure and then enters a NotReady state and stays there indefinitely until the underlying EC2 is terminated. I've been able to repro with test pods running |
I am facing the same issue here. The setup I have is: EC2 instance type: t3.meduim The problem workload is Prometheus - so this is not an issue with only JVM. The behavior is very similar to what @JacobHenner has described. In my case, if I give "enough" time(many hours) to the node, it comes back up without the need for a reboot. |
Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:
You can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this. |
Unfortunately it's not this simple for us. Our clusters host a variety of workloads, and these workloads are managed by a variety of teams. Application owners are told to set their memory requests and limits appropriately, but for a range of reasons this doesn't always happen. In the case where an application is exceeding its memory request and the node encounters memory pressure, we'd expect that application to be OOMKilled or node-pressure evicted because it is misconfigured. We would not expect other properly-configured applications to suffer because memory pressure has caused the entire node to lock up. So we wonder - why aren't pods being killed quickly enough under node memory pressure conditions to prevent this issue from occurring? How can we rely on the Kubernetes requests/limits model if the potential consequences of exceeding a request isn't limited to the offending container, but rather the entire node the container is running on? Team multitenancy seems untenable under these conditions. |
Thanks @jmt-lab for the response. Even though configuring limits reduces the number of crashes, they still happen as our workloads grow in number and capacity. I will keep following this thread for the real fix. Let me know if you need any more context that might help your team with the fix. All the best! |
@JacobHenner @geddah @James-Quigley thank you all for the additional data as we work through a root cause for this issue. One commonality between your problematic variants (and those of previous commenters on this issue) is that the version of containerd in the variants reported here has been susceptible to containerd/containerd#10589. The implication of this issue, as I understand it, is that OOM-killed containers may still appear to containerd as running; this seems to be the inverse of our problem since processes in this bug report are getting OOM-killed, but the fact that this is a bug with the OOM killer makes me suspicious nonetheless. Bottlerocket v1.23.0 bumped containerd to v1.7.22 with a patch for this bug via bottlerocket-core-kit v2.5.0, which inclues bottlerocket-os/bottlerocket-core-kit#143. Would you be willing to upgrade your Bottlerocket nodes to v1.23.0+ and report back with any additional findings? |
Same experience of unresponsiveness of ec2. config:
I never got the issue on my production environment, same config except ec2 types are For now my current test is to move my ebs config to gp3 for the 3000 IOPS. I will see if I continue to see the problem. |
ReproI was able to reproduce the behavior by running a custom program ( I ran a process to monitor the memory usage: while true; do free -m | grep Mem | while read -r _ total used _; do percent=$((used * 100 / total)); printf "Memory Usage: %s / %s (%d%%)\n" "${used}M" "${total}M" "$percent"; done; sleep 1; done The node became unresponsive after the memory consumed hits 100% and it never came back. Attempt for mitigationI tried deploying the userspace oom-killer as DaemonSet for my cluster. It simply manually trigger the oom-killer by After that, I ran the I recommend giving this lightweight oom-killer a try and see if that would help with the issue. The oom-killer would try to kill process based on the
|
@ytsssun for testing purposes the "lightweight oom-killer" might be OK, but it has some notable defects:
I don't think this makes sense as the basis for a production solution. I'd look at running something like nohang or earlyoom in a daemonset instead. |
@bcressey is right, this simple oom-killer is for testing purpose. I am evaluating |
Sorry for the delay. I was finally get the chance to try the Here is my setup. My full steps to verify the Steps:1. Run the memory intensive pods:See the pod specs - https://gist.github.com/ytsssun/f5d8d4a6d4926588914e755577452c88. It tries to launch a number of pods that eat up the host memory. Tune the below env
The node will become OOM and unresponsive.
Node stuck at
2. In a fresh node, run the nohang setup, verify that the nohang Daemonset is running on the target node:
Now run the same workload as mentioned in step 1. The node stayed responsive even when approaching the memory limits. You can see the memory usage fluctuate because the
Logs in
Feel free to give this a try. |
We to have been experiencing this issue. Details
This initially started on memory intensive jobs used for build pipelines. Outside of the required The node would move to a NotReady status and the PODs would get stuck in a Terminating status. Manual deletion of the node was required to resolve (I think the longest we kept a node around before cleaning up was a day or so). As noted by others we resolved this by setting a memory limit such that the job was killed prior to exhausting all the node memory resources. This got us resolution for a while, until we started to see this crop up in other areas not related to nodes/PODs specifically used for builds. The first such case was our Prometheus server. Prometheus memory consumption grows as services + nodes increase. One of our clusters scales quite heavily and this scale up in nodes caused the memory consumption of Prometheus to spike, which caused the node to experience memory pressure and die. EC2 metrics You can see that all network stats drop to 0 at the time the node goes offline. CPU seems to stay flat, although we have also seen cases where the CPU shows a steady 100%. Prometheus metrics for this. node stop reporting at the time of failure. I have attempted to reproduce (not as thorough as others on this thread have) by creating a simple Go process that requests large chunks of memory. What I see happen is what I would expect to happen. When we consume 100% (or close to that) of the node memory, it temporarily goes offline, OOM killer is invoked, process is killed, and the node recovers. This recovery happens in the span of 10 - 20 seconds - sometimes less. One thing I did note, and I may just be interpreting this incorrectly, is that I tried to trigger a hard eviction threshold by consuming just enough memory to have
This stat line is a bit odd to me as free is usually less than available, as available is what is available to processes by evicting cache. So how that can be greater than free - which is memory that is currently not used at all - doesn't make sense. That being said, the box in the above case has < 100Mi of available memory. I would expect that to trigger a NodeMemoryPressure state and the kubelet to kick in and try to free memory. But I don't see that at all. The node just continues to function as is with not state change. It isn't until I consume more memory that the OOM killer eventually kicks off. I have been a user of Bottlerocket for years and I do not recall running into this situation previously. It seems to have only surfaced within the last 9 months or so. |
We're seeing this on nodes running Kafka, which have intermittent high network spikes which seem to break kubelet connection. My observation is that this seems to be happening on smaller nodes, e.g. t3.small where I'm guessing there's some kind of network saturation happening which causes some aspect of networking to crash and never recover. My fix for now is to limit which nodes these workloads are scheduled on, favouring those with higher network capacity to hopefully avoid this issue. |
Hey folks,
Coming to you with an odd issue with Bottlerocket EKS nodes becoming unresponsive and unreachable, but still happily running as seen by EC2.
Image I'm using:
ami-089e696e7c541c61b
(amazon/bottlerocket-aws-k8s-1.29-aarch64-v1.20.2-536d69d0)We use Karpenter as a node provisioner.
What I expected to happen:
Node runs smoothly.
What actually happened:
A large node (
m7g.8xlarge
) is running a memory-heavy JVM workload from a StatefulSet that takes up most of the machine (request=115Gi out of 128Gi).Node runs happily for a while (~ 24h to several days), until it suddenly stops reporting status to Kubernetes:
Kubernetes marks that node as NotReady and taints it with
node.kubernetes.io/unreachable
, then tries to delete the pod running our large workload. That doesn't work (the node is fully unresponsive), so that pod is stuck in status Terminating.The node is unreachable via
kubectl debug node ...
(which times out), or via AWS SSM (which complains of "SSM Agent is not online", "Ping status: Connection lost"). The EC2 Serial Console is empty (all black). We do not have SSH enabled on these machines.However, the node still appears as running from the EC2 console, reachability metrics are green, and we can see CPU / network metrics flowing in.
I've managed to get system logs by taking an EBS snapshot of the Bottlerocket data volume and restoring it to a separate volume for investigation. This was not helpful unfortunately: logs appear normal until the time the node dies, then suddenly stop. There is no indication that anything in particular (kubelet, containerd, etc) crashed and brought the instance down, but also, suspisciously, no logs at all from the moment the node went unresponsive.
How to reproduce the problem:
No clear way to reproduce unfortunately; we've seen this happen sporadically on maybe half a dozen instances over the past few weeks, out of several hundred
This issue is very annoying: I don't mind having pods crashing and/or getting evicted sometimes, or even kubelet/containerd crashing, but I'd expect it to self-heal eventually. This causes our workloads to get stuck, and we have to manually delete the pod and/or the node to get it back to normal. But even worse, I can't see a way to debug it properly or get to the bottom of it.
Would you have any idea of a better way to debug this?
Thank you!
(Note: I also opened an AWS support ticket to see if there's any AWS-level issue at play here, but this seems to happen only on this specific workload on Bottlerocket nodes, so I suspect something is off here)
The text was updated successfully, but these errors were encountered: