Skip to content

can't attach gpu to windows instance #15294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 of 7 tasks
wideawakening opened this issue Mar 31, 2025 · 2 comments
Open
4 of 7 tasks

can't attach gpu to windows instance #15294

wideawakening opened this issue Mar 31, 2025 · 2 comments
Milestone

Comments

@wideawakening
Copy link
Member

wideawakening commented Mar 31, 2025

Please confirm

  • I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

Ubuntu server

Distribution version

22.04.5 LTS

Output of "snap list --all lxd core20 core22 core24 snapd"

ubuntu@hpc-01:~/offline_snaps$ $ snap list
Name        Version                 Rev    Tracking  Publisher      Notes
core        16-2.61.4-20240607      17200  -         canonical✓     core
core22      20250110                1748   -         canonical✓     base
core24      20240920                609    -         canonical✓     base
lxd         5.21.3-75def3c          32455  -         canonical✓     -
microceph   19.2.0+snap2fbf0bad05   1271   -         canonical✓     -
microcloud  2.1.0-3e8b183           1144   -         canonical✓     -
microovn    24.03.2+snapa2c59c105b  667    -         canonical✓     -
snapd       2.67                    23545  -         canonical✓     snapd

Output of "lxc info" or system info if it fails

ubuntu@hpc-03:~$ lxc info --show-log windows11-deleteme
Name: windows11-deleteme
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: hpc-04
Created: 2025/03/14 11:59 CET
Last Used: 2025/03/20 12:14 CET
Error: open /var/snap/lxd/common/lxd/logs/windows11-deleteme/qemu.log: no such file or directory

Issue description

one of our customers is trying to attach a gpu to an existing windows 11 vm, but can't make it work.
gpu attachment does work on a ubuntu desktop vm, which we stopped after validating it, for this report purpose.

ubuntu@hpc-03:~$ lxc config show windows11-deleteme --expanded
architecture: x86_64
config:
  limits.cpu: "4"
  limits.memory: 16GiB
  volatile.cloud-init.instance-id: 1fd758d6-e3ae-4fa0-b89a-e4f1f3297aa7
  volatile.eth-1.host_name: tap4f636280
  volatile.eth-1.hwaddr: 00:16:3e:f3:88:31
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.uuid.generation: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.vsock_id: "1757132482"
devices:
  eth-1:
    network: default
    type: nic
  gpu-1:
    gputype: physical
    pci: 0000:81:00.0
    type: gpu
  root:
    path: /
    pool: remote
    size: 128GiB
    type: disk
ephemeral: false
profiles:
- Windows
stateful: false
description: deleteme

GPU details

  • gpu model NVIDIA Ampere A2, PCIe, 60W, 16GB Passive
  • driver was installed on all cluster nodes (ubuntu server) before launching VM
  • on a ubuntu desktop 22.04 vm, gpu attaching/starting went OK
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.120  Fri Sep 13 10:10:01 UTC 2024
GCC version: 

$ nvidia-detector
nvidia-driver-550
$ nvidia-smi
Wed Feb 26 12:41:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A2                      Off |   00000000:81:00.0 Off |                    0 |
|  0%   26C    P8              5W /   60W |       1MiB /  15356MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ sudo lsmod | grep nvidia
nvidia_uvm           4685824  0
nvidia_drm             98304  0
nvidia_modeset       1343488  1 nvidia_drm
nvidia              54226944  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  4 mgag200,nvidia_drm
drm                   622592  5 drm_kms_helper,nvidia,mgag200,nvidia_drm
$ lxc info --resources | grep -i nvidia -B10 -A10
   ...
  Card 1:
    NUMA node: 1
    Vendor: NVIDIA Corporation (10de)
    Product: GA107GL [A2 / A16] (25b6)
    PCI address: 0000:81:00.0
    Driver: nvidia (550.120)
    DRM:
      ID: 1
      Card: card1 (226:1)
      Control: controlD65 (226:1)
      Render: renderD128 (226:128)
    NVIDIA information:
      Architecture: 8.6
      Brand: Nvidia
      Model: NVIDIA A2
      CUDA Version: 12.4
      NVRM Version: 550.120
      UUID: GPU-4a335844-1d7d-07e8-a2c1-f750fec0994f
    SR-IOV information:
      Current number of VFs: 0
      Maximum number of VFs: 16

Steps to reproduce

first steps were done through the lxd ui (the UI was served from node3/hpc-03).
tpm device is removed after installation, because sometimes vm instance gets stuck on start operation and tpm removal solved it (will re-validate this with customer and open separate issue if needed)

  • installed fresh Windows windows 11 24H2, with tpm device ; ok
  • turned Vm off with tpm attached; ok
  • turned VM on with tpm attached; ok
  • shut down Windows, tpm detached, start windows; ok
  • shut down Windows, attach GPU, start Windows ; ko cannot start instance

some more context

  • test for this bug report were done from a new fresh windows vm windows11-deleteme on a new instance config, following official tutorial
  • microcloud status was healthy
  • windows instance config
ubuntu@hpc-03:~$ lxc config show windows11-deleteme --expanded
architecture: x86_64
config:
  limits.cpu: "4"
  limits.memory: 16GiB
  volatile.cloud-init.instance-id: 1fd758d6-e3ae-4fa0-b89a-e4f1f3297aa7
  volatile.eth-1.host_name: tap4f636280
  volatile.eth-1.hwaddr: 00:16:3e:f3:88:31
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.uuid.generation: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.vsock_id: "1757132482"
devices:
  eth-1:
    network: default
    type: nic
  gpu-1:
    gputype: physical
    pci: 0000:81:00.0
    type: gpu
  root:
    path: /
    pool: remote
    size: 128GiB
    type: disk
ephemeral: false
profiles:
- Windows
stateful: false
description: deleteme
Error: Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 fd=5 fd=6 -- /snap/lxd/32455/bin/qemu-system-x86_64 -S -name Win11 -uuid e894bd88-f281-43b0-b553-abefef07dc94 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/Win11/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/Win11/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/Win11/qemu.pid -D /var/snap/lxd/common/lxd/logs/Win11/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : exit status 1
Try `lxc info --show-log Win11` for more info
Instance update failed. Failed to write backup file: Failed to run: rbd --id admin --cluster ceph --pool lxd_remote unmap virtual-machine_Win11: exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy)

Information to attach

  • Any relevant kernel output (dmesg)
  • Container log (lxc info NAME --show-log)
  • Container configuration (lxc config show NAME --expanded)
  • Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • Output of the client with --debug
  • Output of the daemon with --debug (or use lxc monitor while reproducing the issue)
@tomponline
Copy link
Member

@gabrielmougard please can you see if this helps #14381 ?

@tomponline tomponline added this to the lxd-6.4 milestone Apr 10, 2025
@mihalicyn
Copy link
Member

Hello @wideawakening,

please can you clarify what specifically shut down Windows, attach GPU, start Windows ; ko cannot start instance means.
Does an LXD throws any error on lxc start <instance_name> command or it's just Windows doesn't boot anymore?

I was able to reproduce a behavior when Windows 11 does not boot after attaching an external GPU and when NVIDIA Display driver is installed inside. In my case symptom is that Windows boot process is getting stuck with a UEFI firmware logo and windows loading spinner. Are you experiencing the same symptoms or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants