can't attach gpu to windows instance #15294

wideawakening · 2025-03-31T18:06:52Z

Please confirm

I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

Ubuntu server

Distribution version

22.04.5 LTS

Output of "snap list --all lxd core20 core22 core24 snapd"

ubuntu@hpc-01:~/offline_snaps$ $ snap list
Name        Version                 Rev    Tracking  Publisher      Notes
core        16-2.61.4-20240607      17200  -         canonical✓     core
core22      20250110                1748   -         canonical✓     base
core24      20240920                609    -         canonical✓     base
lxd         5.21.3-75def3c          32455  -         canonical✓     -
microceph   19.2.0+snap2fbf0bad05   1271   -         canonical✓     -
microcloud  2.1.0-3e8b183           1144   -         canonical✓     -
microovn    24.03.2+snapa2c59c105b  667    -         canonical✓     -
snapd       2.67                    23545  -         canonical✓     snapd

Output of "lxc info" or system info if it fails

ubuntu@hpc-03:~$ lxc info --show-log windows11-deleteme
Name: windows11-deleteme
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Location: hpc-04
Created: 2025/03/14 11:59 CET
Last Used: 2025/03/20 12:14 CET
Error: open /var/snap/lxd/common/lxd/logs/windows11-deleteme/qemu.log: no such file or directory

Issue description

one of our customers is trying to attach a gpu to an existing windows 11 vm, but can't make it work.
gpu attachment does work on a ubuntu desktop vm, which we stopped after validating it, for this report purpose.

ubuntu@hpc-03:~$ lxc config show windows11-deleteme --expanded
architecture: x86_64
config:
  limits.cpu: "4"
  limits.memory: 16GiB
  volatile.cloud-init.instance-id: 1fd758d6-e3ae-4fa0-b89a-e4f1f3297aa7
  volatile.eth-1.host_name: tap4f636280
  volatile.eth-1.hwaddr: 00:16:3e:f3:88:31
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.uuid.generation: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.vsock_id: "1757132482"
devices:
  eth-1:
    network: default
    type: nic
  gpu-1:
    gputype: physical
    pci: 0000:81:00.0
    type: gpu
  root:
    path: /
    pool: remote
    size: 128GiB
    type: disk
ephemeral: false
profiles:
- Windows
stateful: false
description: deleteme

GPU details

gpu model NVIDIA Ampere A2, PCIe, 60W, 16GB Passive
driver was installed on all cluster nodes (ubuntu server) before launching VM
on a ubuntu desktop 22.04 vm, gpu attaching/starting went OK

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.120  Fri Sep 13 10:10:01 UTC 2024
GCC version: 

$ nvidia-detector
nvidia-driver-550

$ nvidia-smi
Wed Feb 26 12:41:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A2                      Off |   00000000:81:00.0 Off |                    0 |
|  0%   26C    P8              5W /   60W |       1MiB /  15356MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ sudo lsmod | grep nvidia
nvidia_uvm           4685824  0
nvidia_drm             98304  0
nvidia_modeset       1343488  1 nvidia_drm
nvidia              54226944  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        311296  4 mgag200,nvidia_drm
drm                   622592  5 drm_kms_helper,nvidia,mgag200,nvidia_drm

$ lxc info --resources | grep -i nvidia -B10 -A10
   ...
  Card 1:
    NUMA node: 1
    Vendor: NVIDIA Corporation (10de)
    Product: GA107GL [A2 / A16] (25b6)
    PCI address: 0000:81:00.0
    Driver: nvidia (550.120)
    DRM:
      ID: 1
      Card: card1 (226:1)
      Control: controlD65 (226:1)
      Render: renderD128 (226:128)
    NVIDIA information:
      Architecture: 8.6
      Brand: Nvidia
      Model: NVIDIA A2
      CUDA Version: 12.4
      NVRM Version: 550.120
      UUID: GPU-4a335844-1d7d-07e8-a2c1-f750fec0994f
    SR-IOV information:
      Current number of VFs: 0
      Maximum number of VFs: 16

Steps to reproduce

first steps were done through the lxd ui (the UI was served from node3/hpc-03).
tpm device is removed after installation, because sometimes vm instance gets stuck on start operation and tpm removal solved it (will re-validate this with customer and open separate issue if needed)

installed fresh Windows windows 11 24H2, with tpm device ; ok
turned Vm off with tpm attached; ok
turned VM on with tpm attached; ok
shut down Windows, tpm detached, start windows; ok
shut down Windows, attach GPU, start Windows ; ko cannot start instance

some more context

test for this bug report were done from a new fresh windows vm windows11-deleteme on a new instance config, following official tutorial
microcloud status was healthy
windows instance config

ubuntu@hpc-03:~$ lxc config show windows11-deleteme --expanded
architecture: x86_64
config:
  limits.cpu: "4"
  limits.memory: 16GiB
  volatile.cloud-init.instance-id: 1fd758d6-e3ae-4fa0-b89a-e4f1f3297aa7
  volatile.eth-1.host_name: tap4f636280
  volatile.eth-1.hwaddr: 00:16:3e:f3:88:31
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.uuid.generation: 3f8f4754-053a-4aa5-ab6c-046a1e41103c
  volatile.vsock_id: "1757132482"
devices:
  eth-1:
    network: default
    type: nic
  gpu-1:
    gputype: physical
    pci: 0000:81:00.0
    type: gpu
  root:
    path: /
    pool: remote
    size: 128GiB
    type: disk
ephemeral: false
profiles:
- Windows
stateful: false
description: deleteme

we launched, on same node where the instance was located, sudo journalctl -u snap.lxd.daemon -n 300 and tail -n 300 /var/snap/lxd/common/lxd/logs/lxd.log but nothing relevant there, except some noise feat(ux,warning messages): spam on Error getting disk usage / Cannot get disk usage of unmounted volume when ceph.rbd.du is false #15254
probably not related, but in case it gives some hint, on some previous instance tests, also with same win11 image, but on another instance

Error: Failed to run: forklimits limit=memlock:unlimited:unlimited fd=3 fd=4 fd=5 fd=6 -- /snap/lxd/32455/bin/qemu-system-x86_64 -S -name Win11 -uuid e894bd88-f281-43b0-b553-abefef07dc94 -daemonize -cpu host,hv_passthrough -nographic -serial chardev:console -nodefaults -no-user-config -sandbox on,obsolete=deny,elevateprivileges=allow,spawn=allow,resourcecontrol=deny -readconfig /var/snap/lxd/common/lxd/logs/Win11/qemu.conf -spice unix=on,disable-ticketing=on,addr=/var/snap/lxd/common/lxd/logs/Win11/qemu.spice -pidfile /var/snap/lxd/common/lxd/logs/Win11/qemu.pid -D /var/snap/lxd/common/lxd/logs/Win11/qemu.log -smbios type=2,manufacturer=Canonical Ltd.,product=LXD -runas lxd: : exit status 1
Try `lxc info --show-log Win11` for more info

Instance update failed. Failed to write backup file: Failed to run: rbd --id admin --cluster ceph --pool lxd_remote unmap virtual-machine_Win11: exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy)

Information to attach

Any relevant kernel output (dmesg)
Container log (lxc info NAME --show-log)
Container configuration (lxc config show NAME --expanded)
Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
Output of the client with --debug
Output of the daemon with --debug (or use lxc monitor while reproducing the issue)

The text was updated successfully, but these errors were encountered:

tomponline · 2025-04-10T14:20:58Z

@gabrielmougard please can you see if this helps #14381 ?

mihalicyn · 2025-04-16T15:49:06Z

Hello @wideawakening,

please can you clarify what specifically shut down Windows, attach GPU, start Windows ; ko cannot start instance means.
Does an LXD throws any error on lxc start <instance_name> command or it's just Windows doesn't boot anymore?

I was able to reproduce a behavior when Windows 11 does not boot after attaching an external GPU and when NVIDIA Display driver is installed inside. In my case symptom is that Windows boot process is getting stuck with a UEFI firmware logo and windows loading spinner. Are you experiencing the same symptoms or not?

tomponline added this to the lxd-6.4 milestone Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't attach gpu to windows instance #15294

can't attach gpu to windows instance #15294

wideawakening commented Mar 31, 2025 •

edited

Loading

tomponline commented Apr 10, 2025

mihalicyn commented Apr 16, 2025

can't attach gpu to windows instance #15294

can't attach gpu to windows instance #15294

Comments

wideawakening commented Mar 31, 2025 • edited Loading

Please confirm

Distribution

Distribution version

Output of "snap list --all lxd core20 core22 core24 snapd"

Output of "lxc info" or system info if it fails

Issue description

GPU details

Steps to reproduce

Information to attach

tomponline commented Apr 10, 2025

mihalicyn commented Apr 16, 2025

wideawakening commented Mar 31, 2025 •

edited

Loading