You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here's what k9s says about a particular pod (one of many):
NAME ↑ READY STATUS RS CPU MEM IP NODE QOS AGE
root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng 1/1 Running 0 948m 32763Mi 10.244.5.126 k1.kube GA 17h
And here is the pod described:
Name: root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330-szf2t
Namespace: vg
Priority: 0
PriorityClassName: <none>
Node: k1.kube/172.31.53.234
Start Time: Wed, 13 Nov 2019 16:58:50 -0800
Labels: controller-uid=277bd9e1-9ca5-4c35-9449-9467dfee39de
job-name=root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330
Annotations: <none>
Status: Running
IP: 10.244.5.141
Controlled By: Job/root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-330
Containers:
runner-container:
Container ID: docker://5b70030a0e720e491cbce0d41d5cfdaaffdff0b544e60385bfd30b68f02aa0f7
Image: quay.io/adamnovak/toil:3.21.0a1-e98a7bde0788a2d6f4e55da71ad6e072f8f6a821
Image ID: docker-pullable://quay.io/adamnovak/toil@sha256:734c456697a7059f6945b2c7948692937b5e18ddf3b8caf84313aa733e7fed17
Port: <none>
Host Port: <none>
Command:
_toil_kubernetes_executor
gAJ9cQAoVQtlbnZpcm9ubWVudHEBfXECKFUXVE9JTF9SVF9MT0dHSU5HX0FERFJFU1NxA1USMTAuMjQ0LjEuMTU2OjM4MTkxcQRVFVRPSUxfUlRfTE9HR0lOR19MRVZFTHEFVQRJTkZPcQZ1VQdjb21tYW5kcQdYdwAAAF90b2lsX3dvcmtlciBydW5fY29uc3RydWN0X3JlZ2lvbl9ncmFwaCBhd3M6dXMtd2VzdC0yOmFkYW1ub3Zhay1idWlsZGdpcmFmZmVncmFwaHMgM2M0ZjExNmItMmM2NC00NzdiLWE3ZjktMDk2NTBhOTQ5YjcwcQhVCnVzZXJTY3JpcHRxCWNkaWxsLmRpbGwKX2NyZWF0ZV9uYW1lZHR1cGxlCnEKVRJWaXJ0dWFsRW52UmVzb3VyY2VxCyhVBG5hbWVxDFUIcGF0aEhhc2hxDVUDdXJscQ5VC2NvbnRlbnRIYXNocQ90cRBVDXRvaWwucmVzb3VyY2VxEYdxElJxEyhVDXNpdGUtcGFja2FnZXNxFFUgNTcyMTJmOTAzYTM4ZDJlZTY5ZWMxYTA3MmVhOWVjNjdxFWNmdXR1cmUudHlwZXMubmV3c3RyCm5ld3N0cgpxFliWAAAAaHR0cHM6Ly9hZGFtbm92YWstYnVpbGRnaXJhZmZlZ3JhcGhzLS1maWxlcy5zMy11cy13ZXN0LTIuYW1hem9uYXdzLmNvbS8xOWU4OGM3YS1kNjkyLTVjNTktOTA0NS1iNjM0OTE3M2JmNjA/dmVyc2lvbklkPVJjRTFiZkY4Y1NYbXlGN2JWam1wNHVKTVFSV1pQUk1KcReFcRiBcRl9cRpiVSA5NzhiZTJhZWU2YzVmMDRjNzFkMzU2MDAwYTdjNmRhY3EbdHEcgXEddS4=
State: Running
Started: Wed, 13 Nov 2019 16:58:51 -0800
Ready: True
Restart Count: 0
Limits:
cpu: 1
ephemeral-storage: 34359738368
memory: 34359738368
Requests:
cpu: 1
ephemeral-storage: 34359738368
memory: 34359738368
Environment: <none>
Mounts:
/root/.aws from s3-credentials (rw)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dq8zg (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
s3-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: shared-s3-credentials
Optional: false
default-token-dq8zg:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dq8zg
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
Vyoma:vg anovak$ kubectl describe pod root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng
Name: root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68-7qzng
Namespace: vg
Priority: 0
PriorityClassName: <none>
Node: k1.kube/172.31.53.234
Start Time: Wed, 13 Nov 2019 16:55:58 -0800
Labels: controller-uid=06c54793-c85d-4444-84f3-9edb08e02601
job-name=root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68
Annotations: <none>
Status: Running
IP: 10.244.5.126
Controlled By: Job/root-toil-a3413b4c-cb9c-4f21-90b9-dc39a75409a6-68
Containers:
runner-container:
Container ID: docker://07ba5e432267a254fa35ee146ca403aa1faa32b8cd566e10be9d9c70cabdde38
Image: quay.io/adamnovak/toil:3.21.0a1-e98a7bde0788a2d6f4e55da71ad6e072f8f6a821
Image ID: docker-pullable://quay.io/adamnovak/toil@sha256:734c456697a7059f6945b2c7948692937b5e18ddf3b8caf84313aa733e7fed17
Port: <none>
Host Port: <none>
Command:
_toil_kubernetes_executor
gAJ9cQAoVQtlbnZpcm9ubWVudHEBfXECKFUXVE9JTF9SVF9MT0dHSU5HX0FERFJFU1NxA1USMTAuMjQ0LjEuMTU2OjM4MTkxcQRVFVRPSUxfUlRfTE9HR0lOR19MRVZFTHEFVQRJTkZPcQZ1VQdjb21tYW5kcQdYdwAAAF90b2lsX3dvcmtlciBydW5fY29uc3RydWN0X3JlZ2lvbl9ncmFwaCBhd3M6dXMtd2VzdC0yOmFkYW1ub3Zhay1idWlsZGdpcmFmZmVncmFwaHMgNWM1NmVmZjctMmQ3YS00N2Q4LWFkOTMtZDU3M2IyNDJiMmNmcQhVCnVzZXJTY3JpcHRxCWNkaWxsLmRpbGwKX2NyZWF0ZV9uYW1lZHR1cGxlCnEKVRJWaXJ0dWFsRW52UmVzb3VyY2VxCyhVBG5hbWVxDFUIcGF0aEhhc2hxDVUDdXJscQ5VC2NvbnRlbnRIYXNocQ90cRBVDXRvaWwucmVzb3VyY2VxEYdxElJxEyhVDXNpdGUtcGFja2FnZXNxFFUgNTcyMTJmOTAzYTM4ZDJlZTY5ZWMxYTA3MmVhOWVjNjdxFWNmdXR1cmUudHlwZXMubmV3c3RyCm5ld3N0cgpxFliWAAAAaHR0cHM6Ly9hZGFtbm92YWstYnVpbGRnaXJhZmZlZ3JhcGhzLS1maWxlcy5zMy11cy13ZXN0LTIuYW1hem9uYXdzLmNvbS8xOWU4OGM3YS1kNjkyLTVjNTktOTA0NS1iNjM0OTE3M2JmNjA/dmVyc2lvbklkPVJjRTFiZkY4Y1NYbXlGN2JWam1wNHVKTVFSV1pQUk1KcReFcRiBcRl9cRpiVSA5NzhiZTJhZWU2YzVmMDRjNzFkMzU2MDAwYTdjNmRhY3EbdHEcgXEddS4=
State: Running
Started: Wed, 13 Nov 2019 16:55:59 -0800
Ready: True
Restart Count: 0
Limits:
cpu: 1
ephemeral-storage: 34359738368
memory: 34359738368
Requests:
cpu: 1
ephemeral-storage: 34359738368
memory: 34359738368
Environment: <none>
Mounts:
/root/.aws from s3-credentials (rw)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dq8zg (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
s3-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: shared-s3-credentials
Optional: false
default-token-dq8zg:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dq8zg
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
As far as I can tell the pod is stuck and I suspect not doing any useful work. I can't exec anything in the pod:
I think the inability to shell into the container is caused by opencontainers/runc#1914 and happens I think because the container memory limit is too full to start the shell process. Apparently it is not full enough to get the pod OOM-killed, but I can't imagine it is doing its work properly if the memory limit is pegged.
Toil should either rig the pods to fail if the memory limit is hit like this (somehow; maybe with an exec-based liveness check?), or it should catch these pods as stuck just like it does for pods that get stuck in ImagePullBackoff and clean them up itself.
┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-454
The text was updated successfully, but these errors were encountered:
Here's what k9s says about a particular pod (one of many):
And here is the pod described:
As far as I can tell the pod is stuck and I suspect not doing any useful work. I can't exec anything in the pod:
(I can also get
broken pipe
).I think the inability to shell into the container is caused by opencontainers/runc#1914 and happens I think because the container memory limit is too full to start the shell process. Apparently it is not full enough to get the pod OOM-killed, but I can't imagine it is doing its work properly if the memory limit is pegged.
Toil should either rig the pods to fail if the memory limit is hit like this (somehow; maybe with an
exec
-based liveness check?), or it should catch these pods as stuck just like it does for pods that get stuck in ImagePullBackoff and clean them up itself.┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-454
The text was updated successfully, but these errors were encountered: