-
Notifications
You must be signed in to change notification settings - Fork 2k
Container lost access to GPU on driver version 515.105.01 #1748
Comments
@bowen-dd Can you confirm if valid symlinks exists under |
Thanks for the quick response. I don't see
|
Did you install driver on the node manually or using GPU Operator? Can you run |
@bowen-dd i am surprised to see no |
In any case, you can create this directory manually and run |
I was trying to generate the nvidia-bug-report.log but didn't find nvidia-bug-report.sh, if you think the log report might help, could you please shed some lights on how to get the script. Thanks! |
Gentle bump on this, thanks team! |
nvidia-bug-report.log |
@bowen-dd Can you confirm that even with the |
It does seem to be fixed with runc v1.1.7 |
Hi @shivamerla, I wasn't looking at the right place when I claimed we don't have the symlink. Symlink exists in our instance, please see details below:
|
@elezar any idea why the error might happen even with the symlinks in place? do we recommend to upgrade |
Hi team, just want to follow up to see if the nvidia-bug-report.log is in anyway helpful and could there be anything else we can do to help debugging this issue on our side? Thanks a lot! |
@bowen-dd Are you still hitting the same error ? |
I didn't see any updates on how the container is being started here. Could you provide the command line used? Note that as per #1671 (comment), when using |
How can i download a real dataset from NVIDEA to work on that? |
Hi team, closing this issue as after upgrade runc version, we no longer see this issue on our side at this time. Appreciate your help! |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Also, before reporting a new issue, please make sure that:
1. Issue or feature description
Containers lost GPU access after being idle for a while. We upgraded to the new nvidia driver version 515.105.01 which has the fix for the previous GPU issue .
2. Steps to reproduce the issue
we were able to train successfully on GPU for a decent amount of time (~4 hours), however after the same nodes being idle for a day or two (this is based on our observation, the idling time could be less), the containers lost GPU access again: nvidia-smi returns Failed to initialize NVML: Unknown Error.
nvcc returns the following on our pod:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: