Kubernetes Toil jobs can get apparently stuck with all or almost all of their memory used #2858

adamnovak · 2019-11-14T18:29:36Z

Here's what k9s says about a particular pod (one of many):

NAME ↑                                                        READY      STATUS         RS      CPU       MEM          IP                NODE         QOS      AGE 
root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng       1/1        Running        0       948m      32763Mi      10.244.5.126      k1.kube      GA       17h

And here is the pod described:

Name:               root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330-szf2t
Namespace:          vg
Priority:           0
PriorityClassName:  <none>
Node:               k1.kube/172.31.53.234
Start Time:         Wed, 13 Nov 2019 16:58:50 -0800
Labels:             controller-uid=277bd9e1-9ca5-4c35-9449-9467dfee39de
                    job-name=root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330
Annotations:        <none>
Status:             Running
IP:                 10.244.5.141
Controlled By:      Job/root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330
Containers:
  runner-container:
    Container ID:  docker://5b70030a0e720e491cbce0d41d5cfdaaffdff0b544e60385bfd30b68f02aa0f7
    Image:         quay.io/adamnovak/toil:3.21.0a1-e98a7bde0788a2d6f4e55da71ad6e072f8f6a821
    Image ID:      docker-pullable://quay.io/adamnovak/toil@sha256:734c456697a7059f6945b2c7948692937b5e18ddf3b8caf84313aa733e7fed17
    Port:          <none>
    Host Port:     <none>
    Command:
      _toil_kubernetes_executor
      gAJ9cQAoVQtlbnZpcm9ubWVudHEBfXECKFUXVE9JTF9SVF9MT0dHSU5HX0FERFJFU1NxA1USMTAuMjQ0LjEuMTU2OjM4MTkxcQRVFVRPSUxfUlRfTE9HR0lOR19MRVZFTHEFVQRJTkZPcQZ1VQdjb21tYW5kcQdYdwAAAF90b2lsX3dvcmtlciBydW5fY29uc3RydWN0X3JlZ2lvbl9ncmFwaCBhd3M6dXMtd2VzdC0yOmFkYW1ub3Zhay1idWlsZGdpcmFmZmVncmFwaHMgM2M0ZjExNmItMmM2NC00NzdiLWE3ZjktMDk2NTBhOTQ5YjcwcQhVCnVzZXJTY3JpcHRxCWNkaWxsLmRpbGwKX2NyZWF0ZV9uYW1lZHR1cGxlCnEKVRJWaXJ0dWFsRW52UmVzb3VyY2VxCyhVBG5hbWVxDFUIcGF0aEhhc2hxDVUDdXJscQ5VC2NvbnRlbnRIYXNocQ90cRBVDXRvaWwucmVzb3VyY2VxEYdxElJxEyhVDXNpdGUtcGFja2FnZXNxFFUgNTcyMTJmOTAzYTM4ZDJlZTY5ZWMxYTA3MmVhOWVjNjdxFWNmdXR1cmUudHlwZXMubmV3c3RyCm5ld3N0cgpxFliWAAAAaHR0cHM6Ly9hZGFtbm92YWstYnVpbGRnaXJhZmZlZ3JhcGhzLS1maWxlcy5zMy11cy13ZXN0LTIuYW1hem9uYXdzLmNvbS8xOWU4OGM3YS1kNjkyLTVjNTktOTA0NS1iNjM0OTE3M2JmNjA/dmVyc2lvbklkPVJjRTFiZkY4Y1NYbXlGN2JWam1wNHVKTVFSV1pQUk1KcReFcRiBcRl9cRpiVSA5NzhiZTJhZWU2YzVmMDRjNzFkMzU2MDAwYTdjNmRhY3EbdHEcgXEddS4=
    State:          Running
      Started:      Wed, 13 Nov 2019 16:58:51 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Requests:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Environment:          <none>
    Mounts:
      /root/.aws from s3-credentials (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dq8zg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  s3-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  shared-s3-credentials
    Optional:    false
  default-token-dq8zg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-dq8zg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
Vyoma:vg anovak$ kubectl describe pod root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng
Name:               root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng
Namespace:          vg
Priority:           0
PriorityClassName:  <none>
Node:               k1.kube/172.31.53.234
Start Time:         Wed, 13 Nov 2019 16:55:58 -0800
Labels:             controller-uid=06c54793-c85d-4444-84f3-9edb08e02601
                    job-name=root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68
Annotations:        <none>
Status:             Running
IP:                 10.244.5.126
Controlled By:      Job/root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68
Containers:
  runner-container:
    Container ID:  docker://07ba5e432267a254fa35ee146ca403aa1faa32b8cd566e10be9d9c70cabdde38
    Image:         quay.io/adamnovak/toil:3.21.0a1-e98a7bde0788a2d6f4e55da71ad6e072f8f6a821
    Image ID:      docker-pullable://quay.io/adamnovak/toil@sha256:734c456697a7059f6945b2c7948692937b5e18ddf3b8caf84313aa733e7fed17
    Port:          <none>
    Host Port:     <none>
    Command:
      _toil_kubernetes_executor
      gAJ9cQAoVQtlbnZpcm9ubWVudHEBfXECKFUXVE9JTF9SVF9MT0dHSU5HX0FERFJFU1NxA1USMTAuMjQ0LjEuMTU2OjM4MTkxcQRVFVRPSUxfUlRfTE9HR0lOR19MRVZFTHEFVQRJTkZPcQZ1VQdjb21tYW5kcQdYdwAAAF90b2lsX3dvcmtlciBydW5fY29uc3RydWN0X3JlZ2lvbl9ncmFwaCBhd3M6dXMtd2VzdC0yOmFkYW1ub3Zhay1idWlsZGdpcmFmZmVncmFwaHMgNWM1NmVmZjctMmQ3YS00N2Q4LWFkOTMtZDU3M2IyNDJiMmNmcQhVCnVzZXJTY3JpcHRxCWNkaWxsLmRpbGwKX2NyZWF0ZV9uYW1lZHR1cGxlCnEKVRJWaXJ0dWFsRW52UmVzb3VyY2VxCyhVBG5hbWVxDFUIcGF0aEhhc2hxDVUDdXJscQ5VC2NvbnRlbnRIYXNocQ90cRBVDXRvaWwucmVzb3VyY2VxEYdxElJxEyhVDXNpdGUtcGFja2FnZXNxFFUgNTcyMTJmOTAzYTM4ZDJlZTY5ZWMxYTA3MmVhOWVjNjdxFWNmdXR1cmUudHlwZXMubmV3c3RyCm5ld3N0cgpxFliWAAAAaHR0cHM6Ly9hZGFtbm92YWstYnVpbGRnaXJhZmZlZ3JhcGhzLS1maWxlcy5zMy11cy13ZXN0LTIuYW1hem9uYXdzLmNvbS8xOWU4OGM3YS1kNjkyLTVjNTktOTA0NS1iNjM0OTE3M2JmNjA/dmVyc2lvbklkPVJjRTFiZkY4Y1NYbXlGN2JWam1wNHVKTVFSV1pQUk1KcReFcRiBcRl9cRpiVSA5NzhiZTJhZWU2YzVmMDRjNzFkMzU2MDAwYTdjNmRhY3EbdHEcgXEddS4=
    State:          Running
      Started:      Wed, 13 Nov 2019 16:55:59 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Requests:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Environment:          <none>
    Mounts:
      /root/.aws from s3-credentials (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dq8zg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  s3-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  shared-s3-credentials
    Optional:    false
  default-token-dq8zg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-dq8zg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

As far as I can tell the pod is stuck and I suspect not doing any useful work. I can't exec anything in the pod:

$ kubectl exec -it root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330-szf2t -- /bin/bash
OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer": unknown
command terminated with exit code 126

(I can also get broken pipe).

I think the inability to shell into the container is caused by opencontainers/runc#1914 and happens I think because the container memory limit is too full to start the shell process. Apparently it is not full enough to get the pod OOM-killed, but I can't imagine it is doing its work properly if the memory limit is pegged.

Toil should either rig the pods to fail if the memory limit is hit like this (somehow; maybe with an exec-based liveness check?), or it should catch these pods as stuck just like it does for pods that get stuck in ImagePullBackoff and clean them up itself.

┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-454

The text was updated successfully, but these errors were encountered:

adamnovak mentioned this issue Dec 19, 2019

Consider pods that are very close to their memory limits to be stuck #2895

Merged

adamnovak closed this as completed in #2895 Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Toil jobs can get apparently stuck with all or almost all of their memory used #2858

Kubernetes Toil jobs can get apparently stuck with all or almost all of their memory used #2858

adamnovak commented Nov 14, 2019 •

edited by unito-bot

Loading

Kubernetes Toil jobs can get apparently stuck with all or almost all of their memory used #2858

Kubernetes Toil jobs can get apparently stuck with all or almost all of their memory used #2858

Comments

adamnovak commented Nov 14, 2019 • edited by unito-bot Loading

adamnovak commented Nov 14, 2019 •

edited by unito-bot

Loading