Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod fail to find gpu some time after created #289

Closed
11 tasks
JuHyung-Son opened this issue Jan 28, 2022 · 14 comments
Closed
11 tasks

pod fail to find gpu some time after created #289

JuHyung-Son opened this issue Jan 28, 2022 · 14 comments

Comments

@JuHyung-Son
Copy link

JuHyung-Son commented Jan 28, 2022

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

on Version v0.10.0
At first, pod was able to get gpu resource, and some time after, pod cannot find gpu with error:
i didn't modify cpu_manager_policy and set campatWithCPUManager true

root@instance-81:/# nvidia-smi
Failed to initialize NVML: Unknown Error

2. Steps to reproduce the issue

install nvidia-device-plugin with helm with values

compatWithCPUManager: true
resources:
    limits:
      cpu: 10m
      memory: 50Mi
    requests:
      cpu: 5m
      memory: 30Mi
image:
  repository: nvcr.io/nvidia/k8s-device-plugin
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "v0.10.0"

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
    docker://20.10.7
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
||/ 이름                                                      버전                              Architecture                      설명
+++-=========================================================-=================================-=================================-=======================================================================================================================
un  libgldispatch0-nvidia                                     <none>                            <none>                            (설명 없음)
ii  libnvidia-cfg1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                                        <none>                            <none>                            (설명 없음)
un  libnvidia-common                                          <none>                            <none>                            (설명 없음)
ii  libnvidia-common-465                                      465.19.01-0ubuntu1                all                               Shared files used by the NVIDIA libraries
un  libnvidia-compute                                         <none>                            <none>                            (설명 없음)
rc  libnvidia-compute-460:amd64                               460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA libcompute package
ii  libnvidia-compute-465:amd64                               465.19.01-0ubuntu1                amd64                             NVIDIA libcompute package
ii  libnvidia-container-tools                                 1.7.0-1                           amd64                             NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                                1.7.0-1                           amd64                             NVIDIA container runtime library
un  libnvidia-decode                                          <none>                            <none>                            (설명 없음)
ii  libnvidia-decode-465:amd64                                465.19.01-0ubuntu1                amd64                             NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                                          <none>                            <none>                            (설명 없음)
ii  libnvidia-encode-465:amd64                                465.19.01-0ubuntu1                amd64                             NVENC Video Encoding runtime library
un  libnvidia-extra                                           <none>                            <none>                            (설명 없음)
ii  libnvidia-extra-465:amd64                                 465.19.01-0ubuntu1                amd64                             Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                                            <none>                            <none>                            (설명 없음)
ii  libnvidia-fbc1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                                              <none>                            <none>                            (설명 없음)
ii  libnvidia-gl-465:amd64                                    465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ifr1                                            <none>                            <none>                            (설명 없음)
ii  libnvidia-ifr1-465:amd64                                  465.19.01-0ubuntu1                amd64                             NVIDIA OpenGL-based Inband Frame Readback runtime library
un  libnvidia-ml1                                             <none>                            <none>                            (설명 없음)
un  nvidia-304                                                <none>                            <none>                            (설명 없음)
un  nvidia-340                                                <none>                            <none>                            (설명 없음)
un  nvidia-384                                                <none>                            <none>                            (설명 없음)
un  nvidia-390                                                <none>                            <none>                            (설명 없음)
un  nvidia-common                                             <none>                            <none>                            (설명 없음)
un  nvidia-compute-utils                                      <none>                            <none>                            (설명 없음)
rc  nvidia-compute-utils-460                                  460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA compute utilities
ii  nvidia-compute-utils-465                                  465.19.01-0ubuntu1                amd64                             NVIDIA compute utilities
un  nvidia-container-runtime                                  <none>                            <none>                            (설명 없음)
un  nvidia-container-runtime-hook                             <none>                            <none>                            (설명 없음)
ii  nvidia-container-toolkit                                  1.7.0-1                           amd64                             NVIDIA container runtime hook
rc  nvidia-dkms-460                                           460.91.03-0ubuntu0.18.04.1        amd64                             NVIDIA DKMS package
ii  nvidia-dkms-465                                           465.19.01-0ubuntu1                amd64                             NVIDIA DKMS package
un  nvidia-dkms-kernel                                        <none>                            <none>                            (설명 없음)
un  nvidia-docker                                             <none>                            <none>                            (설명 없음)
ii  nvidia-docker2                                            2.8.0-1                           all                               nvidia-docker CLI wrapper
ii  nvidia-driver-465                                         465.19.01-0ubuntu1                amd64                             NVIDIA driver metapackage
un  nvidia-driver-binary                                      <none>                            <none>                            (설명 없음)
un  nvidia-kernel-common                                      <none>                            <none>                            (설명 없음)
rc  nvidia-kernel-common-460                                  460.91.03-0ubuntu0.18.04.1        amd64                             Shared files used with the kernel module
ii  nvidia-kernel-common-465                                  465.19.01-0ubuntu1                amd64                             Shared files used with the kernel module
un  nvidia-kernel-source                                      <none>                            <none>                            (설명 없음)
un  nvidia-kernel-source-460                                  <none>                            <none>                            (설명 없음)
ii  nvidia-kernel-source-465                                  465.19.01-0ubuntu1                amd64                             NVIDIA kernel source package
un  nvidia-legacy-340xx-vdpau-driver                          <none>                            <none>                            (설명 없음)
ii  nvidia-modprobe                                           510.39.01-0ubuntu1                amd64                             Load the NVIDIA kernel driver and create device files
un  nvidia-opencl-icd                                         <none>                            <none>                            (설명 없음)
un  nvidia-persistenced                                       <none>                            <none>                            (설명 없음)
ii  nvidia-prime                                              0.8.16~0.18.04.1                  all                               Tools to enable NVIDIA's Prime
ii  nvidia-settings                                           510.39.01-0ubuntu1                amd64                             Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary                                    <none>                            <none>                            (설명 없음)
un  nvidia-smi                                                <none>                            <none>                            (설명 없음)
un  nvidia-utils                                              <none>                            <none>                            (설명 없음)
ii  nvidia-utils-465                                          465.19.01-0ubuntu1                amd64                             NVIDIA driver support binaries
un  nvidia-vdpau-driver                                       <none>                            <none>                            (설명 없음)
ii  xserver-xorg-video-nvidia-465                             465.19.01-0ubuntu1                amd64                             NVIDIA binary Xorg driver
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.7.0
lib-version: 1.7.0
build date: 2021-11-30T19:53+00:00
build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
@Crazybean-lwb
Copy link

Crazybean-lwb commented Feb 8, 2022

I met the same issue
k8s-device-plugin version: v0.9.0
docker version: 20.10.10
k8s version: v1.16.15

@JuHyung-Son
Copy link
Author

does this related with this docker warning log?

Feb 03 13:26:33 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:26:33.893079608+09:00" level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
Feb 03 13:26:44 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:26:44.089622040+09:00" level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
Feb 03 13:27:07 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:27:07.435008300+09:00" level=info msg="ignoring event" container=00c76fa9d52809ff132b0d8432cc2b099db92412034fe01120271390e8d416be module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:27:07 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:27:07.536786126+09:00" level=info msg="ignoring event" container=9ce85cd2cbfb4ea3de901612a996bfe511ea8172b059c01a9771c86acaf46927 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:34:53 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:34:53.140428425+09:00" level=info msg="ignoring event" container=63ecee71f42ba00216618deaf07f7085d5b4f7df7045587e5f99296c2c3b75c6 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:34:53 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:34:53.713317281+09:00" level=info msg="ignoring event" container=a4703708b3b30eeec35577ac18c69a8a640bf5f91ff15bc58b0e258778c39df4 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:35:01 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:35:01.534542130+09:00" level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
Feb 03 13:35:03 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:35:03.347891413+09:00" level=info msg="ignoring event" container=94aae0792a1fbc1196764c77f0040a7e61ad1d17674d816d296a4cc3af27c456 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:35:03 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:35:03.972117658+09:00" level=info msg="ignoring event" container=ec048b844cc4cf70ab431f06515c5c1cd27b46594c6dfbf6f1f278004547fddf module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:35:11 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:35:11.421454019+09:00" level=warning msg="Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap."
Feb 03 13:43:20 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:43:20.779625198+09:00" level=info msg="ignoring event" container=2a22d4b35a6e41645d834e7b5412d9b85720deb1224b2da09740023a8f08eeb3 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:43:21 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:43:21.569712637+09:00" level=info msg="ignoring event" container=752b9a3d7a2fa0734c32e9e3b2c64061bc881f2402c6660235393aa0fb61e27d module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:43:30 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:43:30.652947669+09:00" level=info msg="ignoring event" container=353a080497ecb7296611c8fe06d3dc1ba929423470472876f828a26bc5a6a726 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 13:43:31 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T13:43:31.594562540+09:00" level=info msg="ignoring event" container=a23885523f624101825d4b4793d8b39f802de7d53a40ab5f2ff71efe5380a156 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 03 15:57:58 upstage-private-gpu12 dockerd[67345]: time="2022-02-03T15:57:58.966785795+09:00" level=info msg="ignoring event" container=61f3f5f7e099ee7c79dd7dbdfd525d3575524ed8bf21c3ef73a43bee40e0d763 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

@Crazybean-lwb
Copy link

yes, it does. But, I do not know the meaning.

@elezar
Copy link
Member

elezar commented Feb 9, 2022

@liuweibin6566396837 is there a way for you to monitor the memory usage of the device plugin pod? It may be that the memory limit is too restrictive.

@Crazybean-lwb
Copy link

@liuweibin6566396837 is there a way for you to monitor the memory usage of the device plugin pod? It may be that the memory limit is too restrictive.

Thanks for your reply. device plugin pod Qos was set to BestEffort by default, memory limit may not be the problem, Even though the pod restart at times.

@klueska
Copy link
Contributor

klueska commented Feb 9, 2022

We've seen this issue popping up on some people's setups but do not have a diagnosis yet. In general this can happen if some component on your system is calling Update() against your container, causing it to lose access to the cgroups associated with the GPUs.

It's possible it is linked to an issue that was recently fixed in runC:
opencontainers/runc#2366 (comment)

This bug causes every update to a container's devices cgroup to temporarily block access to all devices before applying the update (even if there is no change in permissions). If a GPU was accessed during this small time window, you would see the error you are describing.

That said, we've seen this issue pop up even on systems with a version of runC that fixes this bug, so it's not 100% conclusive.

The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC.

@JuHyung-Son
Copy link
Author

@klueska
Does this can be related to unattended-upgrade??
In my some nodes, unattended-upgrade upgrade nvidia-driver and some other packages and some pod lose gpu resources

@JuHyung-Son
Copy link
Author

@elezar does this can be related to Nvidia-device-plugin pod memory?
some of my pod lose gpus which uses nvidia-device-plugin memory about to request memory value.
image

@elezar
Copy link
Member

elezar commented Feb 23, 2022

If a node has the nvidia-driver updated and the pods continue to run, then this could explain why these pods cannot access the driver. The version-specific libraries are mounted into the container, and if these are removed from the host as part of a driver upgrade this may cause the containers to stop working.

Is the problem limited to nodes that have unattended upgrades enabled?

@JuHyung-Son
Copy link
Author

@elezar
I disabled unattended upgrade of docker and nvidia related packages. but this problem still occur.
And some nodes that does not have this problem, were enabling unattended upgrades. So I concluded this may not be a reason.

@JuHyung-Son
Copy link
Author

JuHyung-Son commented Feb 28, 2022

@elezar @klueska
I just found a pretty suspicious clue.

In my cluster, nodes with kernel version 5.4.0 with ubuntu 18.04 have this problem, but nodes with kernel version 5.11.0 with ubuntu 20.04 does not have this problem

@JuHyung-Son
Copy link
Author

I just found something weird.
on some nodes with above issue, they use libnvidia-container>=1.8.0 and use only cgroup.

> mount | grep '^cgroup' | awk '{print $1}' | uniq
cgroup

and other nodes with no issue, they use libnvidia-container>=1.7.0-1 with cgroup2

> mount | grep '^cgroup' | awk '{print $1}' | uniq
cgroup2
cgroup

@JuHyung-Son
Copy link
Author

Issue solved.

I found this issue is appeared on ubuntu 18.04 using cgroup.
And I upgrade ubuntu version to 20.04 that using cgroupV2, then issue does not appeared.

@aisensiy
Copy link

We've seen this issue popping up on some people's setups but do not have a diagnosis yet. In general this can happen if some component on your system is calling Update() against your container, causing it to lose access to the cgroups associated with the GPUs.

It's possible it is linked to an issue that was recently fixed in runC: opencontainers/runc#2366 (comment)

This bug causes every update to a container's devices cgroup to temporarily block access to all devices before applying the update (even if there is no change in permissions). If a GPU was accessed during this small time window, you would see the error you are describing.

That said, we've seen this issue pop up even on systems with a version of runC that fixes this bug, so it's not 100% conclusive.

The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC.

Is this done ? I think this issue is still happening in latest nvidia device plugin (0.12.3) for pod without a Guaranteed QoS. Using cgroup v1 or v2 in ubuntu 22.04 does not make any change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants