Container lost access to GPU on driver version 515.105.01 #1748

bowen-dd · 2023-04-24T19:50:22Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:

You read carefully the documentation and frequently asked questions.
You searched for a similar issue and this is not a duplicate of an existing one.
This issue is not related to NGC, otherwise, please use the devtalk forums instead.
You went through the troubleshooting steps.

1. Issue or feature description

Containers lost GPU access after being idle for a while. We upgraded to the new nvidia driver version 515.105.01 which has the fix for the previous GPU issue .

2. Steps to reproduce the issue

we were able to train successfully on GPU for a decent amount of time (~4 hours), however after the same nodes being idle for a day or two (this is based on our observation, the idling time could be less), the containers lost GPU access again: nvidia-smi returns Failed to initialize NVML: Unknown Error.

nvcc returns the following on our pod:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

3. Information to attach (optional if deemed irrelevant)

Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
Kernel version from uname -a
Any relevant kernel output lines from dmesg
Driver information from nvidia-smi -a
Docker version from docker version
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
NVIDIA container library logs (see troubleshooting)
Docker command, image and tag used

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-04-24T23:51:54Z

@bowen-dd Can you confirm if valid symlinks exists under /dev/char/ on the node and pointing to nvidia devices? If they are not present you can run sudo nvidia-ctk system create-dev-char-symlinks to create them which will avoid this issue. But the driver version you pointed should have this fix which we need to double check.

bowen-dd · 2023-04-24T23:57:01Z

Thanks for the quick response. I don't see /dev/char/ folder on my node.
ls /dev/ shows the following:

core  full    null        nvidia-uvm-tools  nvidia1  nvidia3    ptmx  random  stderr  stdout           tty      zero
fd    mqueue  nvidia-uvm  nvidia0           nvidia2  nvidiactl  pts   shm     stdin   termination-log  urandom

shivamerla · 2023-04-25T00:10:37Z

Did you install driver on the node manually or using GPU Operator? Can you run nvidia-smi and see if those symlinks getting created? I just confirmed that the driver version has a fix to create necessary symlinks under /dev/char.

shivamerla · 2023-04-25T00:13:14Z

@bowen-dd i am surprised to see no /dev/char folder at all on the node. We should see some character devices by default(tty, sg* etc) Can you share details on the configuration?

shivamerla · 2023-04-25T00:25:40Z

In any case, you can create this directory manually and run nvidia-smi for these symlinks to be created.

bowen-dd · 2023-04-25T18:24:41Z

sudo nvidia-ctk system create-dev-char-symlinks created the symlink.
nvidia-smi returns the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   21C    P8    22W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   21C    P8    21W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   21C    P8    21W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   20C    P8    23W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I was trying to generate the nvidia-bug-report.log but didn't find nvidia-bug-report.sh, if you think the log report might help, could you please shed some lights on how to get the script. Thanks!

bowen-dd · 2023-04-27T20:53:12Z

Gentle bump on this, thanks team!

bowen-dd · 2023-04-27T22:32:39Z

nvidia-bug-report.log
nvidia-bug-report.log is attached, at the time of this script was generated, nvidia-smi returned Failed to initialize NVML: Unknown Error in the pod.

shivamerla · 2023-04-27T23:45:13Z

@bowen-dd Can you confirm that even with the /dev/char symlinks created now, you are still hitting this error? From the recent discussions here and here looks like runc recently made some changes which will avoid having /dev/char symlinks in place. But need to wait until next release of runc to update. @elezar please confirm

shivamerla · 2023-04-27T23:47:33Z

It does seem to be fixed with runc v1.1.7

bowen-dd · 2023-04-28T17:21:15Z

Hi @shivamerla, I wasn't looking at the right place when I claimed we don't have the symlink. Symlink exists in our instance, please see details below:

/dev/char# ls -al |grep nvidia
lrwxrwxrwx  1 root root   15 Apr 25 17:26 0:0 -> /dev/nvidia-uvm
lrwxrwxrwx  1 root root   21 Apr 25 17:26 0:1 -> /dev/nvidia-uvm-tools
lrwxrwxrwx  1 root root   12 Apr 25 17:26 0:2 -> /dev/nvidia2
lrwxrwxrwx  1 root root   19 Apr 25 17:26 0:254 -> /dev/nvidia-modeset
lrwxrwxrwx  1 root root   14 Apr 25 17:26 0:255 -> /dev/nvidiactl
lrwxrwxrwx  1 root root   12 Apr 25 17:26 0:3 -> /dev/nvidia3
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:0 -> ../nvidia0
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:1 -> ../nvidia1
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:2 -> ../nvidia2
lrwxrwxrwx  1 root root   12 Apr 27 22:29 195:255 -> ../nvidiactl
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:3 -> ../nvidia3
lrwxrwxrwx  1 root root   26 Apr 27 22:29 238:1 -> ../nvidia-caps/nvidia-cap1
lrwxrwxrwx  1 root root   26 Apr 27 22:29 238:2 -> ../nvidia-caps/nvidia-cap2

shivamerla · 2023-04-28T17:55:35Z

@elezar any idea why the error might happen even with the symlinks in place? do we recommend to upgrade runc to v1.1.7?

bowen-dd · 2023-05-01T16:25:56Z

Hi team, just want to follow up to see if the nvidia-bug-report.log is in anyway helpful and could there be anything else we can do to help debugging this issue on our side? Thanks a lot!

arunraman · 2023-05-01T18:25:45Z

@bowen-dd Are you still hitting the same error ?

elezar · 2023-05-02T09:34:49Z

I didn't see any updates on how the container is being started here. Could you provide the command line used?

Note that as per #1671 (comment), when using systemd cgroup management it is required to pass the device nodes to the docker command line when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.

ehsanmns · 2023-05-07T14:39:58Z

How can i download a real dataset from NVIDEA to work on that?

bowen-dd · 2023-05-10T21:14:07Z

Hi team, closing this issue as after upgrade runc version, we no longer see this issue on our side at this time. Appreciate your help!

bowen-dd closed this as completed May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container lost access to GPU on driver version 515.105.01 #1748

Container lost access to GPU on driver version 515.105.01 #1748

bowen-dd commented Apr 24, 2023

shivamerla commented Apr 24, 2023

bowen-dd commented Apr 24, 2023

shivamerla commented Apr 25, 2023

shivamerla commented Apr 25, 2023

shivamerla commented Apr 25, 2023

bowen-dd commented Apr 25, 2023

bowen-dd commented Apr 27, 2023

bowen-dd commented Apr 27, 2023 •

edited

Loading

shivamerla commented Apr 27, 2023 •

edited

Loading

shivamerla commented Apr 27, 2023 •

edited

Loading

bowen-dd commented Apr 28, 2023 •

edited

Loading

shivamerla commented Apr 28, 2023

bowen-dd commented May 1, 2023

arunraman commented May 1, 2023

elezar commented May 2, 2023

ehsanmns commented May 7, 2023

bowen-dd commented May 10, 2023

Container lost access to GPU on driver version 515.105.01 #1748

Container lost access to GPU on driver version 515.105.01 #1748

Comments

bowen-dd commented Apr 24, 2023

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

shivamerla commented Apr 24, 2023

bowen-dd commented Apr 24, 2023

shivamerla commented Apr 25, 2023

shivamerla commented Apr 25, 2023

shivamerla commented Apr 25, 2023

bowen-dd commented Apr 25, 2023

bowen-dd commented Apr 27, 2023

bowen-dd commented Apr 27, 2023 • edited Loading

shivamerla commented Apr 27, 2023 • edited Loading

shivamerla commented Apr 27, 2023 • edited Loading

bowen-dd commented Apr 28, 2023 • edited Loading

shivamerla commented Apr 28, 2023

bowen-dd commented May 1, 2023

arunraman commented May 1, 2023

elezar commented May 2, 2023

ehsanmns commented May 7, 2023

bowen-dd commented May 10, 2023

bowen-dd commented Apr 27, 2023 •

edited

Loading

shivamerla commented Apr 27, 2023 •

edited

Loading

shivamerla commented Apr 27, 2023 •

edited

Loading

bowen-dd commented Apr 28, 2023 •

edited

Loading