Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node is "NotReady" and waiting at "Terminating" for hours #1573

Open
ibalat opened this issue Aug 14, 2024 · 24 comments
Open

Node is "NotReady" and waiting at "Terminating" for hours #1573

ibalat opened this issue Aug 14, 2024 · 24 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@ibalat
Copy link

ibalat commented Aug 14, 2024

Description

Observed Behavior:

  • Node is "NotReady" status but EC2 instance still exists in aws ec2 instance list and status is "Running", checks are "Passed"
  • Node status reason is "NodeStatusUnknown", message "Kubelet stopped posting node status"
  • Pods are waiting at "Terminating"
  • karpenter have logs related with pods waiting at "Terminating" like:

{"level":"INFO","time":"2024-08-14T12:13:23.794Z","logger":"controller","message":"pod xxxx has a preferred Anti-Affinity which can prevent consolidation","commit":"490ef94","controller":"provisioner"}

  • related ec2 instance logged lastly message:
[  423.353932] [  21815]  1001 21815  1314351    45882   770048        0          1000 java
[  423.361183] [  22145] 65532 22145   475493    12653   364544        0          1000 controller
[  423.368709] [  22199]  1001 22199   914462    84514   987136        0          1000 java
[  423.376073] [  33276]     0 33276   295992      601   188416        0          -998 runc
[  423.383344] [  33288]     0 33288     3094       12    45056        0          -998 exe
[  423.390531] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod1344160e_dca0_4e9d_be15_ea0b63efb5b2.slice/cri-containerd-496edffa072b6d7835989a0dfbce3c3071
1a32903c757baf4fcd460c9479f3a8.scope,task=java,pid=22199,uid=1001
[  423.412634] Out of memory: Killed process 22199 (java) total-vm:3657848kB, anon-rss:338056kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:964kB oom_score_adj:1000
[  425.563371] oom_reaper: reaped process 22199 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        2024-08-14T13:38:15+00:00

image

image

image

Expected Behavior:

  • Karpenter have to remove node if it's not ready and provision new node

Reproduction Steps (Please include YAML):
I don't have any idea. It occur periodically

Versions:

  • Chart Version: 0.37.0
  • Kubernetes Version (kubectl version): 1.30
@ibalat ibalat added the kind/bug Categorizes issue or PR as related to a bug. label Aug 14, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 14, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ibalat
Copy link
Author

ibalat commented Aug 14, 2024

omg, after 6h later, still pods at "Terminating" status and node is "NotReady".

image

btw, instance is m5.large. And I got new instance stdout logs:

[ 8080.945657] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/config/mysql/1 supports timestamps until 2038 (0x7fffffff)
[ 8080.956982] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/template-sql/mysql/2 supports timestamps until 2038 (0x7fffffff)
[ 8080.970168] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/template-sql/mysql/3 supports timestamps until 2038 (0x7fffffff)
[ 8080.981712] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/etl-sql/mysql/4 supports timestamps until 2038 (0x7fffffff)
[ 8080.993163] xfs filesystem being remounted at /var/lib/kubelet/pods/92d74e36-0bbb-40bb-9d92-d9daa4994369/volume-subpaths/prefera-sql/mysql/5 supports timestamps until 2038 (0x7fffffff)
[ 8112.949302] pci 0000:00:1d.0: [1d0f:8061] type 00 class 0x010802
[ 8112.952794] pci 0000:00:1d.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 8112.956559] pci 0000:00:1d.0: enabling Extended Tags
[ 8112.960301] pci 0000:00:1d.0: BAR 0: assigned [mem 0xc0114000-0xc0117fff]
[ 8112.964132] nvme nvme3: pci function 0000:00:1d.0
[ 8112.967238] nvme 0000:00:1d.0: enabling device (0000 -> 0002)
[ 8112.972352] PCI Interrupt Link [LNKA] enabled at IRQ 11
[ 8112.980317] nvme nvme3: 2/0/0 default/read/poll queues
[ 8113.229053] pci 0000:00:1c.0: [1d0f:8061] type 00 class 0x010802
[ 8113.232693] pci 0000:00:1c.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 8113.236424] pci 0000:00:1c.0: enabling Extended Tags
[ 8113.240326] pci 0000:00:1c.0: BAR 0: assigned [mem 0xc0118000-0xc011bfff]
[ 8113.244141] nvme nvme4: pci function 0000:00:1c.0
[ 8113.247190] nvme 0000:00:1c.0: enabling device (0000 -> 0002)
[ 8113.256918] nvme nvme4: 2/0/0 default/read/poll queues
[ 8113.573770] EXT4-fs (nvme3n1): mounted filesystem with ordered data mode. Opts: (null)
[ 8114.159309] IPv6: ADDRCONF(NETDEV_CHANGE): enia89b8c83c9a: link becomes ready
[ 8114.163261] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 8114.319775] xfs filesystem being remounted at /var/lib/kubelet/pods/177f12fc-a42d-464f-bdf0-ad1f53080f8b/volume-subpaths/scripts/kafka/2 supports timestamps until 2038 (0x7fffffff)
[ 8114.734723] EXT4-fs (nvme4n1): mounted filesystem with ordered data mode. Opts: (null)
[ 8114.972359] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 8115.074504] xfs filesystem being remounted at /var/lib/kubelet/pods/ad839521-93a0-4010-8b3f-0980d2375063/volume-subpaths/scripts/kafka/2 supports timestamps until 2038 (0x7fffffff)
[ 8119.176023] pci 0000:00:1b.0: [1d0f:8061] type 00 class 0x010802
[ 8119.179548] pci 0000:00:1b.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 8119.183285] pci 0000:00:1b.0: enabling Extended Tags
[ 8119.187010] pci 0000:00:1b.0: BAR 0: assigned [mem 0xc011c000-0xc011ffff]
[ 8119.190838] nvme nvme5: pci function 0000:00:1b.0
[ 8119.193879] nvme 0000:00:1b.0: enabling device (0000 -> 0002)
[ 8119.203356] nvme nvme5: 2/0/0 default/read/poll queues
[ 8120.146390] EXT4-fs (nvme5n1): mounted filesystem with ordered data mode. Opts: (null)
[ 8120.658980] IPv6: ADDRCONF(NETDEV_CHANGE): eni61bfec53e4d: link becomes ready
[ 8120.662926] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 8121.038972] xfs filesystem being remounted at /var/lib/kubelet/pods/9d3cd637-f153-4200-a302-04b9e60a273c/volume-subpaths/scripts/kafka/2 supports timestamps until 2038 (0x7fffffff)
[ 8299.855030] systemd-journald[537510]: File /var/log/journal/ec23eae178c2480d1224169d16678fc2/system.journal corrupted or uncleanly shut down, renaming and replacing.

@sftim
Copy link

sftim commented Aug 15, 2024

If you're willing to try Karpenter 1.0 (newly released), you might see better behavior or diagnostics. I'd give it a go, honestly.

@ibalat
Copy link
Author

ibalat commented Aug 15, 2024

@sftim thanks for suggestion, I'll try it but why karpenter or K8S doesn't intervene this situation? 18h passed and they are still waiting NotReady and Terminating. Is there any parameter to force terminate notready nodes? ttlAfterNotRegistered parameter deprecated and my consolidateAfter: 5m config not working for this situation :/

image

@jigisha620
Copy link
Contributor

HI @ibalat,
From the information that you have shared, it seems like the node registered but never got initialized. Karpenter handles registration failures by waiting for 15 minutes to check if the node registers, if it doesn't then we go ahead and delete the nodeClaim. But we still have an open issue for nodes that Karpenter never initializes at all, which should be captured by #750 where we are hoping to start by introducing a static TTL for initialization to kill off nodes that don't ever go Ready on startup. Can you describe the nodeClaim for this node and share it? Can you also share the logs from the time this happened so that we can confirm that's the issue?

@ibalat
Copy link
Author

ibalat commented Aug 15, 2024

hi @jigisha620 , actually, nodes had initialized because these nodes are becoming "Ready", then pods are being scheduling and finally after a while (~30-60mins later) node is passing "NotReady" status. So, they work properly for a while. I tried to upgrade v1.0.0 but still same problem occur. I am sharing my nodeclass, nodepool and nodeclaim configs. Btw, do you know why pods still waiting at "Terminating" status? K8s or karpenter can force delete them after a while? Is there any config for that?

Also I found newly events, maybe they are related with this issue. Their repeat count so much

image
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: main
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  role: "KarpenterNodeRole"
  subnetSelectorTerms:
    %{~ for subnet in eks_dev_v1_subnet_ids ~}
    - id: "${subnet}"
    %{~ endfor ~}
  securityGroupSelectorTerms:
    - name: "*dev-v1-node*"
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: main-green
spec:
  template:
    metadata:
      labels:
        node-group-name: main-green
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: main
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [ "r5", "m5", "c6i" ]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      terminationGracePeriod: 5m
      expireAfter: 720h # 30 * 24h = 720h | periodically recycle nodes due to security concerns
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "17843341971500854913"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
  creationTimestamp: "2024-08-15T10:59:10Z"
  finalizers:
  - karpenter.k8s.aws/termination
  generation: 1
  name: main
  resourceVersion: "525655958"
  uid: 742b9052-735a-4078-b2d3-bbfe0cf883e3
spec:
  amiSelectorTerms:
  - alias: al2023@latest
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required
  role: KarpenterNodeRole
  securityGroupSelectorTerms:
  - name: '*dev-v1-node*'
  subnetSelectorTerms:
  - id: subnet-xx
  - id: subnet-xx
  - id: subnet-xx
status:
  amis:
  - id: ami-0d43f736643876936
    name: amazon-eks-node-al2023-arm64-standard-1.30-v20240807
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - arm64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  - id: ami-0d694ee9037e1f937
    name: amazon-eks-node-al2023-x86_64-standard-1.30-v20240807
    requirements:
    - key: kubernetes.io/arch
      operator: In
      values:
      - amd64
    - key: karpenter.k8s.aws/instance-gpu-count
      operator: DoesNotExist
    - key: karpenter.k8s.aws/instance-accelerator-count
      operator: DoesNotExist
  conditions:
  - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: AMIsReady
    status: "True"
    type: AMIsReady
  - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: InstanceProfileReady
    status: "True"
    type: InstanceProfileReady
  - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
 - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: SecurityGroupsReady
    status: "True"
    type: SecurityGroupsReady
  - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: SubnetsReady
    status: "True"
    type: SubnetsReady
  instanceProfile: dev-v1_xx
  securityGroups:
  - id: sg-xx
    name: dev-v1-xx
  - id: sg-xx
    name: dev-v1-xx
  subnets:
  - id: subnet-xx
    zone: eu-west-1c
    zoneID: euw1-az2
  - id: subnet-xx
    zone: eu-west-1a
    zoneID: euw1-az3
  - id: subnet-xx
    zone: eu-west-1b
    zoneID: euw1-az1
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "14203437024067510703"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-08-15T10:55:03Z"
  generation: 1
  name: main-green
  resourceVersion: "525888522"
  uid: 5866c52d-bb13-479f-b034-822128ebc8f1
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 5m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: 1000
  template:
    metadata:
      labels:
        node-group-name: main-green
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: main
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
       - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
        - m
        - r
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - r5
        - m5
        - c6i
      - key: karpenter.k8s.aws/instance-generation
        operator: Gt
        values:
        - "2"
      terminationGracePeriod: 5m
status:
  conditions:
  - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: NodeClassReady
    status: "True"
    type: NodeClassReady
  - lastTransitionTime: "2024-08-15T10:59:11Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-15T10:55:03Z"
    message: ""
    reason: ValidationSucceeded
    status: "True"
    type: ValidationSucceeded
  resources:
    cpu: "294"
    ephemeral-storage: 417873520Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 695806732Ki
    nodes: "20"
    pods: "2425"
apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
  annotations:
    compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
    compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
    karpenter.k8s.aws/ec2nodeclass-hash: "17843341971500854913"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
    karpenter.k8s.aws/tagged: "true"
    karpenter.sh/nodepool-hash: "14203437024067510703"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-08-15T12:05:33Z"
  finalizers:
  - karpenter.sh/termination
  generateName: main-green-
  generation: 1
  labels:
    karpenter.k8s.aws/instance-category: c
    karpenter.k8s.aws/instance-cpu: "32"
    karpenter.k8s.aws/instance-cpu-manufacturer: intel
    karpenter.k8s.aws/instance-ebs-bandwidth: "10000"
    karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
    karpenter.k8s.aws/instance-family: c6i
    karpenter.k8s.aws/instance-generation: "6"
    karpenter.k8s.aws/instance-hypervisor: nitro
    karpenter.k8s.aws/instance-memory: "65536"
    karpenter.k8s.aws/instance-network-bandwidth: "12500"
    karpenter.k8s.aws/instance-size: 8xlarge
    karpenter.sh/capacity-type: spot
    karpenter.sh/nodepool: main-green
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node-group-name: main-green
    node.kubernetes.io/instance-type: c6i.8xlarge
    topology.k8s.aws/zone-id: euw1-az1
    topology.kubernetes.io/region: eu-west-1
    topology.kubernetes.io/zone: eu-west-1b
  name: main-green-7rncx
  ownerReferences:
  - apiVersion: karpenter.sh/v1
    blockOwnerDeletion: true
    kind: NodePool
    name: main-green
    uid: 5866c52d-bb13-479f-b034-822128ebc8f1
  resourceVersion: "525859504"
  uid: bd1aea84-18be-4d42-9c17-3936137c89a5
spec:
  expireAfter: 720h
  nodeClassRef:
    group: karpenter.k8s.aws
    kind: EC2NodeClass
    name: main
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c6i.12xlarge
    - c6i.16xlarge
    - c6i.24xlarge
    - c6i.32xlarge
    - c6i.8xlarge
    - c6i.metal
    - m5.12xlarge
    - m5.16xlarge
    - m5.24xlarge
    - m5.4xlarge
    - m5.8xlarge
    - m5.metal
    - r5.12xlarge
    - r5.16xlarge
    - r5.24xlarge
    - r5.4xlarge
    - r5.8xlarge
    - r5.metal
  - key: node-group-name
      operator: In
    values:
    - main-green
  - key: karpenter.k8s.aws/instance-generation
    operator: Gt
    values:
    - "2"
  - key: karpenter.sh/nodepool
    operator: In
    values:
    - main-green
  - key: karpenter.k8s.aws/instance-category
    operator: In
    values:
    - c
    - m
    - r
  - key: karpenter.k8s.aws/instance-family
    operator: In
    values:
    - c6i
    - m5
    - r5
  resources:
    requests:
      cpu: 4280m
      memory: 36152Mi
      pods: "67"
  terminationGracePeriod: 5m0s
status:
  allocatable:
    cpu: 31850m
    ephemeral-storage: 17Gi
    memory: 57691Mi
    pods: "234"
    vpc.amazonaws.com/pod-eni: "84"
  capacity:
    cpu: "32"
    ephemeral-storage: 20Gi
    memory: 60620Mi
    pods: "234"
    vpc.amazonaws.com/pod-eni: "84"
  conditions:
  - lastTransitionTime: "2024-08-15T12:15:35Z"
    message: ""
    reason: ConsistentStateFound
    status: "True"
    type: ConsistentStateFound
  - lastTransitionTime: "2024-08-15T15:46:53Z"
    message: ""
    reason: Consolidatable
    status: "True"
    type: Consolidatable
  - lastTransitionTime: "2024-08-15T12:06:14Z"
    message: ""
    reason: Initialized
    status: "True"
    type: Initialized
  - lastTransitionTime: "2024-08-15T12:05:35Z"
    message: ""
    reason: Launched
    status: "True"
    type: Launched
  - lastTransitionTime: "2024-08-15T12:06:14Z"
    message: ""
    reason: Ready
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-08-15T12:06:04Z"
    message: ""
    reason: Registered
    status: "True"
    type: Registered
  imageID: ami-0d694ee9037e1f937
  lastPodEventTime: "2024-08-15T15:41:53Z"
  nodeName: ip-10-xx-xx-xx.eu-west-1.compute.internal
  providerID: aws:///eu-west-1b/i-xxxxxx

@jigisha620
Copy link
Contributor

I think that the snippet that you have shared with "No allowed disruptions for disruption reason" is not the problem here. The nodes that you have, were already in NotReady state so they will not be considered for allowed disruptions. Can you share Karpenter controller logs from the same time?

@ibalat
Copy link
Author

ibalat commented Aug 16, 2024

sure, between 05:58:24 and 06:09:12 3 nodes became NotReady and I saw them lively. But no related log :( You can see all logs between these times:

{"level":"INFO","time":"2024-08-16T05:58:24.287Z","logger":"controller","message":"created nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:26.268Z","logger":"controller","message":"launched nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:54.219Z","logger":"controller","message":"pod(s) have a preferred Anti-Affinity which can prevent consolidation",
{"level":"INFO","time":"2024-08-16T05:58:54.360Z","logger":"controller","message":"found provisionable pod(s)",
{"level":"INFO","time":"2024-08-16T05:58:54.360Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)",
{"level":"INFO","time":"2024-08-16T05:58:54.360Z","logger":"controller","message":"computed 1 unready node(s) will fit 1 pod(s)",
{"level":"INFO","time":"2024-08-16T05:58:54.376Z","logger":"controller","message":"created nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:56.599Z","logger":"controller","message":"deleted node",
{"level":"INFO","time":"2024-08-16T05:58:56.870Z","logger":"controller","message":"launched nodeclaim",
{"level":"INFO","time":"2024-08-16T05:58:56.902Z","logger":"controller","message":"deleted nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:19.838Z","logger":"controller","message":"registered nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:20.169Z","logger":"controller","message":"registered nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:24.803Z","logger":"controller","message":"pod(s) have a preferred Anti-Affinity which can prevent consolidation",
{"level":"INFO","time":"2024-08-16T05:59:37.493Z","logger":"controller","message":"initialized nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:38.378Z","logger":"controller","message":"initialized nodeclaim",
{"level":"INFO","time":"2024-08-16T05:59:49.497Z","logger":"controller","message":"deleted node",
{"level":"INFO","time":"2024-08-16T05:59:49.706Z","logger":"controller","message":"deleted nodeclaim",
{"level":"INFO","time":"2024-08-16T06:08:45.766Z","logger":"controller","message":"found provisionable pod(s)",
{"level":"INFO","time":"2024-08-16T06:08:45.766Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)",
{"level":"INFO","time":"2024-08-16T06:08:45.777Z","logger":"controller","message":"created nodeclaim",
{"level":"INFO","time":"2024-08-16T06:08:48.176Z","logger":"controller","message":"launched nodeclaim",
{"level":"INFO","time":"2024-08-16T06:09:12.703Z","logger":"controller","message":"registered nodeclaim",

@ibalat
Copy link
Author

ibalat commented Aug 16, 2024

new update: not deletable node (although terminationGracePeriod: 5m and passed more time) show some events, maybe it can help

image

Node's nodeclaim have events below:

image

pods in node are waiting "Terminating" state and don't have any event or log at describe.

After I deleted nodeclaim manually, node deleted (But passed graceperiodtime).

@jigisha620
Copy link
Contributor

TerminationGracePeriod would not work if delete has not been called against the nodeClaim. In your case node went to NotReady state but nothing initiated it's deletion. I was able to reproduce something similar on my end where my node becomes NotReady due to Kubelet stopped posting node status. However, pods got rescheduled onto a different node. That makes me wonder if the pods you are running have some pre-stop hook that's preventing them from terminating?

@ibalat
Copy link
Author

ibalat commented Aug 19, 2024

No prestop hook, finalizer or another thing. Just waiting like at screenshots.

@jigisha620
Copy link
Contributor

This is not necessarily an issue from Karpenter. To investigate further, we will have to take a look at the kubelet logs to know why pods remained stuck at Terminating. Since you are using an eks ami, you can run a script that's on your worker node at /etc/eks called log-collector-script which would help us get the kubelet logs. If you have AWS premium support then you can open a ticket to investigate those logs or you can send them over and I can try looking into them.

@ibalat
Copy link
Author

ibalat commented Aug 28, 2024

when it happens, I couldn't login EC2, it doesn't response. But I could get stdout, it below.

[  423.390531] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod1344160e_dca0_4e9d_be15_ea0b63efb5b2.slice/cri-containerd-496edffa072b6d7835989a0dfbce3c3071
1a32903c757baf4fcd460c9479f3a8.scope,task=java,pid=22199,uid=1001
[  423.412634] Out of memory: Killed process 22199 (java) total-vm:3657848kB, anon-rss:338056kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:964kB oom_score_adj:1000
[  425.563371] oom_reaper: reaped process 22199 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        2024-08-14T13:38:15+00:00

@suraj2410
Copy link

we see this too many times

@JacobHenner
Copy link

@ibalat @suraj2410

What do the disk IOPS, disk idle time, and memory metrics look like for the affected hosts? Could this be the problem described in bottlerocket-os/bottlerocket#4075 (comment)? (applicable to Bottlerocket, but also observed with AL2).

@ibalat
Copy link
Author

ibalat commented Sep 1, 2024

I had removed karpenter and reinstalled cluster autoscaler. But I can test it again in this week. After test, I will share results with you

@dcherniv
Copy link

This is a common thing with most kubernetes providers/autoscalers. For some reason the general position of k8s (as a whole) is to not touch nodes that are stuck like this. This is a philosophical dilemma essentially.
"Do we want to keep the stuck nodes for troubleshooting or do we want to force terminate them?"
And there are good arguments for both points. In my humble opinion nodes that stop posting status for longer than certain threshold should be force-terminated.
If you are running kubernetes at scale and your apps and nodes are properly HA you don't really care what happens to any given node. It was the k8s promise after all, cattle not pets.
I, personally, have no interest in troubleshooting solar flares, memory flipped bits and reasons why an OOMKill or kernel.pid_max exhaustion causes kubelet to go into weird state, provided my other nodes are healthy.

@paalkr
Copy link

paalkr commented Dec 2, 2024

We also experience this quite frequently, but only with nodes that Karpenter has scheduled for disruption. If I totally disallow Karpenter from replacing nodes by setting the node budget to 0, the issue does not occur at all.

apiVersion: karpenter.sh/v1
kind: NodePool
...
spec:
  disruption:
    budgets:
    - nodes: "0"
...

If I let Karpenter disrupt nodes, then we see the issue reappearing very frequently.

Karpenter version 1.0.2
eks version 1.29

@GnatorX
Copy link

GnatorX commented Jan 21, 2025

@ibalat Have you attempted cutting a ticket with AWS to investigate how the node got partitioned from the control plane?

@engedaam
Copy link
Contributor

/assign @garvinp-stripe

@k8s-ci-robot
Copy link
Contributor

@engedaam: GitHub didn't allow me to assign the following users: garvinp-stripe.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @garvinp-stripe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@engedaam
Copy link
Contributor

/assign @GnatorX

@ibalat
Copy link
Author

ibalat commented Jan 23, 2025

Have you attempted cutting a ticket with AWS to investigate how the node got partitioned from the control plane?

No, I had removed karpenter because of the issues and couldn't try again.

@GnatorX
Copy link

GnatorX commented Jan 24, 2025

Is this issue only showing up with Karpenter?

What is weird to me is that this is the normal way with how autoscaler handles partitioned nodes as mentioned by @dcherniv
#1573 (comment)

Official docs:
https://kubernetes.io/docs/concepts/architecture/nodes/#node-controller.

However I wonder if you have something configured differently on your nodes between Karpenter vs cluster-autoscaler for node termination.
https://karpenter.sh/docs/concepts/disruption/#termination-controller
Are you running things that might be important to the node's connectivity that isn't tolerating Karpenter's taints?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

10 participants