Skip to content

CDI mode doesn't work for multiple GPUs #15372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 of 7 tasks
olympichek opened this issue Apr 12, 2025 · 1 comment
Open
3 of 7 tasks

CDI mode doesn't work for multiple GPUs #15372

olympichek opened this issue Apr 12, 2025 · 1 comment
Milestone

Comments

@olympichek
Copy link

Please confirm

  • I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

Ubuntu

Distribution version

24.04.2 LTS

Output of "snap list --all lxd core20 core22 core24 snapd"

Name    Version   Rev    Tracking       Publisher   Notes
core22  20250210  1802   latest/stable  canonical✓  base,disabled
core22  20250315  1908   latest/stable  canonical✓  base
core24  20241217  739    latest/stable  canonical✓  base
snapd   2.67      23545  latest/stable  canonical✓  snapd,disabled
snapd   2.67.1    23771  latest/stable  canonical✓  snapd

Output of "lxc info" or system info if it fails

I am tryinf

Issue description

CDI mode doesn't work with multiple GPUs, though with a single GPU everything is working fine. I am getting the following error when I try to start the container after adding a second GPU device:

Error: Failed to start device "gpu1": remove /var/snap/lxd/common/lxd/devices/test/cdi.disk.gpu0.etc-vulkan-icd.d-nvidia_icd.json: device or resource busy
Try `lxc info --show-log test` for more info

The output of lxc info --show-log test:

Name: test
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2025/04/12 16:51 EEST
Last Used: 2025/04/12 16:51 EEST

Log:

Steps to reproduce

  1. sudo lxc init ubuntu:24.04 test
  2. sudo lxc config device add test gpu0 gpu gputype=physical id=nvidia.com/gpu=0
  3. sudo lxc config device add test gpu1 gpu gputype=physical id=nvidia.com/gpu=1
  4. sudo lxc start test

Information to attach

  • Any relevant kernel output (dmesg)
  • Instance log (lxc info NAME --show-log)
  • Instance configuration (lxc config show NAME --expanded)
  • Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • Output of the client with --debug
  • Output of the daemon with --debug (or use lxc monitor while reproducing the issue)
@olympichek
Copy link
Author

The output of lxc config show test --expanded:

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 24.04 LTS amd64 (release) (20250403)
  image.label: release
  image.os: ubuntu
  image.release: noble
  image.serial: "20250403"
  image.type: squashfs
  image.version: "24.04"
  volatile.base_image: 9f684552788a49591b1336a37e943296d346e345252cade971377a8d4df4e9c7
  volatile.cloud-init.instance-id: 53d35705-3a57-46b3-adcc-2e85d2070fa8
  volatile.eth0.hwaddr: 00:16:3e:d5:98:15
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 2d1b54c2-12c3-4c35-9591-f3dd5a0080c6
  volatile.uuid.generation: 2d1b54c2-12c3-4c35-9591-f3dd5a0080c6
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  gpu0:
    gputype: physical
    id: nvidia.com/gpu=0
    type: gpu
  gpu1:
    gputype: physical
    id: nvidia.com/gpu=1
    type: gpu
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

@tomponline tomponline added this to the lxd-6.4 milestone Apr 14, 2025
@skozina skozina added the Jira Triggers the synchronization of a GitHub issue in Jira label Apr 15, 2025
@skozina skozina assigned skozina and unassigned skozina Apr 15, 2025
@skozina skozina removed the Jira Triggers the synchronization of a GitHub issue in Jira label Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants