Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes Toil jobs can get apparently stuck with all or almost all of their memory used #2858

Closed
adamnovak opened this issue Nov 14, 2019 · 0 comments · Fixed by #2895
Closed

Comments

@adamnovak
Copy link
Member

adamnovak commented Nov 14, 2019

Here's what k9s says about a particular pod (one of many):

NAME ↑                                                        READY      STATUS         RS      CPU       MEM          IP                NODE         QOS      AGE 
root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng       1/1        Running        0       948m      32763Mi      10.244.5.126      k1.kube      GA       17h

And here is the pod described:

Name:               root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330-szf2t
Namespace:          vg
Priority:           0
PriorityClassName:  <none>
Node:               k1.kube/172.31.53.234
Start Time:         Wed, 13 Nov 2019 16:58:50 -0800
Labels:             controller-uid=277bd9e1-9ca5-4c35-9449-9467dfee39de
                    job-name=root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330
Annotations:        <none>
Status:             Running
IP:                 10.244.5.141
Controlled By:      Job/root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330
Containers:
  runner-container:
    Container ID:  docker://5b70030a0e720e491cbce0d41d5cfdaaffdff0b544e60385bfd30b68f02aa0f7
    Image:         quay.io/adamnovak/toil:3.21.0a1-e98a7bde0788a2d6f4e55da71ad6e072f8f6a821
    Image ID:      docker-pullable://quay.io/adamnovak/toil@sha256:734c456697a7059f6945b2c7948692937b5e18ddf3b8caf84313aa733e7fed17
    Port:          <none>
    Host Port:     <none>
    Command:
      _toil_kubernetes_executor
      gAJ9cQAoVQtlbnZpcm9ubWVudHEBfXECKFUXVE9JTF9SVF9MT0dHSU5HX0FERFJFU1NxA1USMTAuMjQ0LjEuMTU2OjM4MTkxcQRVFVRPSUxfUlRfTE9HR0lOR19MRVZFTHEFVQRJTkZPcQZ1VQdjb21tYW5kcQdYdwAAAF90b2lsX3dvcmtlciBydW5fY29uc3RydWN0X3JlZ2lvbl9ncmFwaCBhd3M6dXMtd2VzdC0yOmFkYW1ub3Zhay1idWlsZGdpcmFmZmVncmFwaHMgM2M0ZjExNmItMmM2NC00NzdiLWE3ZjktMDk2NTBhOTQ5YjcwcQhVCnVzZXJTY3JpcHRxCWNkaWxsLmRpbGwKX2NyZWF0ZV9uYW1lZHR1cGxlCnEKVRJWaXJ0dWFsRW52UmVzb3VyY2VxCyhVBG5hbWVxDFUIcGF0aEhhc2hxDVUDdXJscQ5VC2NvbnRlbnRIYXNocQ90cRBVDXRvaWwucmVzb3VyY2VxEYdxElJxEyhVDXNpdGUtcGFja2FnZXNxFFUgNTcyMTJmOTAzYTM4ZDJlZTY5ZWMxYTA3MmVhOWVjNjdxFWNmdXR1cmUudHlwZXMubmV3c3RyCm5ld3N0cgpxFliWAAAAaHR0cHM6Ly9hZGFtbm92YWstYnVpbGRnaXJhZmZlZ3JhcGhzLS1maWxlcy5zMy11cy13ZXN0LTIuYW1hem9uYXdzLmNvbS8xOWU4OGM3YS1kNjkyLTVjNTktOTA0NS1iNjM0OTE3M2JmNjA/dmVyc2lvbklkPVJjRTFiZkY4Y1NYbXlGN2JWam1wNHVKTVFSV1pQUk1KcReFcRiBcRl9cRpiVSA5NzhiZTJhZWU2YzVmMDRjNzFkMzU2MDAwYTdjNmRhY3EbdHEcgXEddS4=
    State:          Running
      Started:      Wed, 13 Nov 2019 16:58:51 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Requests:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Environment:          <none>
    Mounts:
      /root/.aws from s3-credentials (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dq8zg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  s3-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  shared-s3-credentials
    Optional:    false
  default-token-dq8zg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-dq8zg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
Vyoma:vg anovak$ kubectl describe pod root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng
Name:               root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng
Namespace:          vg
Priority:           0
PriorityClassName:  <none>
Node:               k1.kube/172.31.53.234
Start Time:         Wed, 13 Nov 2019 16:55:58 -0800
Labels:             controller-uid=06c54793-c85d-4444-84f3-9edb08e02601
                    job-name=root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68
Annotations:        <none>
Status:             Running
IP:                 10.244.5.126
Controlled By:      Job/root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68
Containers:
  runner-container:
    Container ID:  docker://07ba5e432267a254fa35ee146ca403aa1faa32b8cd566e10be9d9c70cabdde38
    Image:         quay.io/adamnovak/toil:3.21.0a1-e98a7bde0788a2d6f4e55da71ad6e072f8f6a821
    Image ID:      docker-pullable://quay.io/adamnovak/toil@sha256:734c456697a7059f6945b2c7948692937b5e18ddf3b8caf84313aa733e7fed17
    Port:          <none>
    Host Port:     <none>
    Command:
      _toil_kubernetes_executor
      gAJ9cQAoVQtlbnZpcm9ubWVudHEBfXECKFUXVE9JTF9SVF9MT0dHSU5HX0FERFJFU1NxA1USMTAuMjQ0LjEuMTU2OjM4MTkxcQRVFVRPSUxfUlRfTE9HR0lOR19MRVZFTHEFVQRJTkZPcQZ1VQdjb21tYW5kcQdYdwAAAF90b2lsX3dvcmtlciBydW5fY29uc3RydWN0X3JlZ2lvbl9ncmFwaCBhd3M6dXMtd2VzdC0yOmFkYW1ub3Zhay1idWlsZGdpcmFmZmVncmFwaHMgNWM1NmVmZjctMmQ3YS00N2Q4LWFkOTMtZDU3M2IyNDJiMmNmcQhVCnVzZXJTY3JpcHRxCWNkaWxsLmRpbGwKX2NyZWF0ZV9uYW1lZHR1cGxlCnEKVRJWaXJ0dWFsRW52UmVzb3VyY2VxCyhVBG5hbWVxDFUIcGF0aEhhc2hxDVUDdXJscQ5VC2NvbnRlbnRIYXNocQ90cRBVDXRvaWwucmVzb3VyY2VxEYdxElJxEyhVDXNpdGUtcGFja2FnZXNxFFUgNTcyMTJmOTAzYTM4ZDJlZTY5ZWMxYTA3MmVhOWVjNjdxFWNmdXR1cmUudHlwZXMubmV3c3RyCm5ld3N0cgpxFliWAAAAaHR0cHM6Ly9hZGFtbm92YWstYnVpbGRnaXJhZmZlZ3JhcGhzLS1maWxlcy5zMy11cy13ZXN0LTIuYW1hem9uYXdzLmNvbS8xOWU4OGM3YS1kNjkyLTVjNTktOTA0NS1iNjM0OTE3M2JmNjA/dmVyc2lvbklkPVJjRTFiZkY4Y1NYbXlGN2JWam1wNHVKTVFSV1pQUk1KcReFcRiBcRl9cRpiVSA5NzhiZTJhZWU2YzVmMDRjNzFkMzU2MDAwYTdjNmRhY3EbdHEcgXEddS4=
    State:          Running
      Started:      Wed, 13 Nov 2019 16:55:59 -0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Requests:
      cpu:                1
      ephemeral-storage:  34359738368
      memory:             34359738368
    Environment:          <none>
    Mounts:
      /root/.aws from s3-credentials (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dq8zg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  s3-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  shared-s3-credentials
    Optional:    false
  default-token-dq8zg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-dq8zg
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

As far as I can tell the pod is stuck and I suspect not doing any useful work. I can't exec anything in the pod:

$ kubectl exec -it root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330-szf2t -- /bin/bash
OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer": unknown
command terminated with exit code 126

(I can also get broken pipe).

I think the inability to shell into the container is caused by opencontainers/runc#1914 and happens I think because the container memory limit is too full to start the shell process. Apparently it is not full enough to get the pod OOM-killed, but I can't imagine it is doing its work properly if the memory limit is pegged.

Toil should either rig the pods to fail if the memory limit is hit like this (somehow; maybe with an exec-based liveness check?), or it should catch these pods as stuck just like it does for pods that get stuck in ImagePullBackoff and clean them up itself.

┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant