-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod fail to find gpu some time after created #289
Comments
I met the same issue |
does this related with this docker warning log?
|
yes, it does. But, I do not know the meaning. |
@liuweibin6566396837 is there a way for you to monitor the memory usage of the device plugin pod? It may be that the memory limit is too restrictive. |
Thanks for your reply. device plugin pod Qos was set to BestEffort by default, memory limit may not be the problem, Even though the pod restart at times. |
We've seen this issue popping up on some people's setups but do not have a diagnosis yet. In general this can happen if some component on your system is calling It's possible it is linked to an issue that was recently fixed in This bug causes every update to a container's devices cgroup to temporarily block access to all devices before applying the update (even if there is no change in permissions). If a GPU was accessed during this small time window, you would see the error you are describing. That said, we've seen this issue pop up even on systems with a version of The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath |
@klueska |
@elezar does this can be related to Nvidia-device-plugin pod memory? |
If a node has the Is the problem limited to nodes that have unattended upgrades enabled? |
@elezar |
I just found something weird.
and other nodes with no issue, they use
|
Issue solved. I found this issue is appeared on ubuntu 18.04 using cgroup. |
Is this done ? I think this issue is still happening in latest nvidia device plugin (0.12.3) for pod without a |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
on Version v0.10.0
At first, pod was able to get gpu resource, and some time after, pod cannot find gpu with error:
i didn't modify
cpu_manager_policy
and setcampatWithCPUManager
true
2. Steps to reproduce the issue
install nvidia-device-plugin with helm with values
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
docker://20.10.7
uname -a
dmesg
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: