Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Container lost access to GPU on driver version 515.105.01 #1748

Closed
9 tasks
bowen-dd opened this issue Apr 24, 2023 · 17 comments
Closed
9 tasks

Container lost access to GPU on driver version 515.105.01 #1748

bowen-dd opened this issue Apr 24, 2023 · 17 comments

Comments

@bowen-dd
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:


1. Issue or feature description

Containers lost GPU access after being idle for a while. We upgraded to the new nvidia driver version 515.105.01 which has the fix for the previous GPU issue .

2. Steps to reproduce the issue

we were able to train successfully on GPU for a decent amount of time (~4 hours), however after the same nodes being idle for a day or two (this is based on our observation, the idling time could be less), the containers lost GPU access again: nvidia-smi returns Failed to initialize NVML: Unknown Error.

nvcc returns the following on our pod:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
  • Docker version from docker version
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
@shivamerla
Copy link

@bowen-dd Can you confirm if valid symlinks exists under /dev/char/ on the node and pointing to nvidia devices? If they are not present you can run sudo nvidia-ctk system create-dev-char-symlinks to create them which will avoid this issue. But the driver version you pointed should have this fix which we need to double check.

@bowen-dd
Copy link
Author

Thanks for the quick response. I don't see /dev/char/ folder on my node.
ls /dev/ shows the following:

core  full    null        nvidia-uvm-tools  nvidia1  nvidia3    ptmx  random  stderr  stdout           tty      zero
fd    mqueue  nvidia-uvm  nvidia0           nvidia2  nvidiactl  pts   shm     stdin   termination-log  urandom

@shivamerla
Copy link

Did you install driver on the node manually or using GPU Operator? Can you run nvidia-smi and see if those symlinks getting created? I just confirmed that the driver version has a fix to create necessary symlinks under /dev/char.

@shivamerla
Copy link

@bowen-dd i am surprised to see no /dev/char folder at all on the node. We should see some character devices by default(tty, sg* etc) Can you share details on the configuration?

@shivamerla
Copy link

In any case, you can create this directory manually and run nvidia-smi for these symlinks to be created.

@bowen-dd
Copy link
Author

sudo nvidia-ctk system create-dev-char-symlinks created the symlink.
nvidia-smi returns the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   21C    P8    22W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   21C    P8    21W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   21C    P8    21W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   20C    P8    23W / 300W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I was trying to generate the nvidia-bug-report.log but didn't find nvidia-bug-report.sh, if you think the log report might help, could you please shed some lights on how to get the script. Thanks!

@bowen-dd
Copy link
Author

Gentle bump on this, thanks team!

@bowen-dd
Copy link
Author

bowen-dd commented Apr 27, 2023

nvidia-bug-report.log
nvidia-bug-report.log is attached, at the time of this script was generated, nvidia-smi returned Failed to initialize NVML: Unknown Error in the pod.

@shivamerla
Copy link

shivamerla commented Apr 27, 2023

@bowen-dd Can you confirm that even with the /dev/char symlinks created now, you are still hitting this error? From the recent discussions here and here looks like runc recently made some changes which will avoid having /dev/char symlinks in place. But need to wait until next release of runc to update. @elezar please confirm

@shivamerla
Copy link

shivamerla commented Apr 27, 2023

It does seem to be fixed with runc v1.1.7

@bowen-dd
Copy link
Author

bowen-dd commented Apr 28, 2023

Hi @shivamerla, I wasn't looking at the right place when I claimed we don't have the symlink. Symlink exists in our instance, please see details below:

/dev/char# ls -al |grep nvidia
lrwxrwxrwx  1 root root   15 Apr 25 17:26 0:0 -> /dev/nvidia-uvm
lrwxrwxrwx  1 root root   21 Apr 25 17:26 0:1 -> /dev/nvidia-uvm-tools
lrwxrwxrwx  1 root root   12 Apr 25 17:26 0:2 -> /dev/nvidia2
lrwxrwxrwx  1 root root   19 Apr 25 17:26 0:254 -> /dev/nvidia-modeset
lrwxrwxrwx  1 root root   14 Apr 25 17:26 0:255 -> /dev/nvidiactl
lrwxrwxrwx  1 root root   12 Apr 25 17:26 0:3 -> /dev/nvidia3
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:0 -> ../nvidia0
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:1 -> ../nvidia1
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:2 -> ../nvidia2
lrwxrwxrwx  1 root root   12 Apr 27 22:29 195:255 -> ../nvidiactl
lrwxrwxrwx  1 root root   10 Apr 27 22:29 195:3 -> ../nvidia3
lrwxrwxrwx  1 root root   26 Apr 27 22:29 238:1 -> ../nvidia-caps/nvidia-cap1
lrwxrwxrwx  1 root root   26 Apr 27 22:29 238:2 -> ../nvidia-caps/nvidia-cap2

@shivamerla
Copy link

@elezar any idea why the error might happen even with the symlinks in place? do we recommend to upgrade runc to v1.1.7?

@bowen-dd
Copy link
Author

bowen-dd commented May 1, 2023

Hi team, just want to follow up to see if the nvidia-bug-report.log is in anyway helpful and could there be anything else we can do to help debugging this issue on our side? Thanks a lot!

@arunraman
Copy link

@bowen-dd Are you still hitting the same error ?

@elezar
Copy link
Member

elezar commented May 2, 2023

I didn't see any updates on how the container is being started here. Could you provide the command line used?

Note that as per #1671 (comment), when using systemd cgroup management it is required to pass the device nodes to the docker command line when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.

@ehsanmns
Copy link

ehsanmns commented May 7, 2023

How can i download a real dataset from NVIDEA to work on that?

@bowen-dd
Copy link
Author

Hi team, closing this issue as after upgrade runc version, we no longer see this issue on our side at this time. Appreciate your help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants