Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSP Firmware not loaded properly, make nvidia-vgpu-manager-daemonset CrashLoopBackOff #1278

Open
rjhaikal opened this issue Feb 17, 2025 · 1 comment

Comments

@rjhaikal
Copy link

rjhaikal commented Feb 17, 2025

1. Quick Debug Information

  • OS/Version: Ubuntu22.04
  • Kernel Version: Linux 5.15.0-117-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.7.17-k3s1.28
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k3s v1.28.11+k3s1
  • GPU Operator Version: gpu-operator-v24.6.0

2. Issue or feature description

When I install the vgpu-manager, the error appears to be /usr/local/bin/nvidia-driver: line 1: popd: directory stack empty. I check dmesg log in the node it gives me Direct firmware load for nvidia/550.54.10/gsp_ga10x.bin failed with error -2. It looks like the firmware not loaded properly

Can somebody help me how to resolve this error with step by step reference?

3. Information to attach (optional if deemed irrelevant)

  1. Logs pod nvidia-vgpu-manager-daemonset
+ DRIVER_VERSION=550.54.10
+ DRIVER_ARCH=x86_64
+ DRIVER_RESET_RETRIES=10
++ uname -r
+ KERNEL_VERSION=5.15.0-117-generic
+ RUN_DIR=/run/nvidia
+ export DEBIAN_FRONTEND=noninteractive
+ DEBIAN_FRONTEND=noninteractive
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=5.15.0-117-generic
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ init
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_vgpu_vfio_refs=0
+ echo 'Stopping NVIDIA vGPU Manager...'
+ '[' -f /var/run/nvidia-vgpu-mgr/nvidia-vgpu-mgr.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_vgpu_vfio/refcnt ']'
Stopping NVIDIA vGPU Manager...
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia/refcnt ']'
+ nvidia_refs=0
+ rmmod_args+=("nvidia")
+ '[' 1 -gt 0 ']'
+ rmmod nvidia
+ '[' 0 '!=' 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ umount -l -R /run/nvidia/driver
Updating the package cache...
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
+ apt-get -qq update
+ _resolve_kernel_version
++ apt-cache show linux-headers-5.15.0-117-generic
++ sed -nE 's/^Version:\s+(([0-9]+\.){2}[0-9]+)[-.]([0-9]+).*/\1-\3/p'
++ head -1
+ local version=5.15.0-117
++ echo 5.15.0-117-generic
++ sed 's/[^a-z]*//'
++ grep -Ev '^generic|virtual'
+ local flavor=
+ echo 'Resolving Linux kernel version...'
+ '[' -z 5.15.0-117 ']'
Resolving Linux kernel version...
+ KERNEL_VERSION=5.15.0-117-generic
+ echo 'Proceeding with Linux kernel version 5.15.0-117-generic'
+ return 0
Proceeding with Linux kernel version 5.15.0-117-generic
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.kG79fBJSB3
+ trap 'popd; rm -rf /tmp/tmp.kG79fBJSB3' RETURN EXIT
+ pushd /tmp/tmp.kG79fBJSB3
/tmp/tmp.kG79fBJSB3 /driver
+ rm -rf /lib/modules/5.15.0-117-generic
+ mkdir -p /lib/modules/5.15.0-117-generic/proc
+ echo 'Installing Linux kernel headers...'
Installing Linux kernel headers...
+ apt-get -qq install --no-install-recommends linux-headers-5.15.0-117-generic
+ echo 'Installing Linux kernel module files...'
+ apt-get -qq download linux-image-5.15.0-117-generic
Installing Linux kernel module files...
+ dpkg -x linux-image-5.15.0-117-generic_5.15.0-117.127_amd64.deb .
+ mv lib/modules/5.15.0-117-generic/modules.builtin lib/modules/5.15.0-117-generic/modules.builtin.modinfo lib/modules/5.15.0-117-generic/modules.order /lib/modules/5.15.0-117-generic
+ mv lib/modules/5.15.0-117-generic/kernel /lib/modules/5.15.0-117-generic
+ depmod 5.15.0-117-generic
+ echo 'Generating Linux kernel version string...'
Generating Linux kernel version string...
+ file boot/vmlinuz-5.15.0-117-generic
+ awk 'BEGIN { RS="," } $1=="version" { print $2 }' -
+ '[' -z 5.15.0-117-generic ']'
+ mv version /lib/modules/5.15.0-117-generic/proc
/driver
++ popd
++ rm -rf /tmp/tmp.kG79fBJSB3
Creating '/dev/char' directory
+ _create_dev_char_directory
+ '[' '!' -d /dev/char ']'
+ echo 'Creating '\''/dev/char'\'' directory'
+ mkdir -p /dev/char
+ _install_driver
++ mktemp -d
+ local tmp_dir=/tmp/tmp.GGwiHUdShK
+ sh NVIDIA-Linux-x86_64-550.54.10-vgpu-kvm.run --ui=none --no-questions --tmpdir /tmp/tmp.GGwiHUdShK --no-systemd
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.54.10......................................................................................................................................................................................................................................................................................................................................................................................................................................................................

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 128 CPUs online; setting concurrency level to 32.
Unable to locate any tools for listing initramfs contents.
Unable to scan initramfs: no tool found
This system requires use of the NVIDIA open kernel modules; these will be selected by default.
Installing NVIDIA driver version 550.54.10.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/5.15.0-117-generic/build'

Kernel output path: '/lib/modules/5.15.0-117-generic/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules: 

  [##############################] 100%
Kernel module compilation complete.
Kernel messages:
[ 3941.692907] nvidia 0000:03:00.0: driver left SR-IOV enabled after remove
[ 3941.693251] nvidia 0000:64:00.0: driver left SR-IOV enabled after remove
[ 3941.693477] nvidia 0000:63:00.0: driver left SR-IOV enabled after remove
[ 3941.693818] nvidia 0000:e4:00.0: driver left SR-IOV enabled after remove
[ 3941.694212] nvidia 0000:e3:00.0: driver left SR-IOV enabled after remove
[ 3941.694639] NVOC: __nvoc_objDelete: Child class OBJIOVASPACE not freed from parent class OBJVMM.
[ 3941.694790] nvidia-nvlink: Unregistered Nvlink Core, major device number 499
[ 3989.137665] nvidia-nvlink: Nvlink Core is being initialized, major device number 499
[ 3989.137675] NVRM: The NVIDIA probe routine was not called for 256 device(s).
[ 3989.570567] NVRM: This can occur when another driver was loaded and 
               NVRM: obtained ownership of the NVIDIA device(s).
[ 3989.570570] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.
[ 3989.570590] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.54.10  Release Build  (dvs-builder@U16-I3-B13-2-1)  Wed Feb 14 16:21:59 UTC 2024
[ 3989.716774] nvidia 0000:84:00.0: driver left SR-IOV enabled after remove
[ 3989.717546] nvidia 0000:83:00.0: driver left SR-IOV enabled after remove
[ 3989.718001] nvidia 0000:04:00.0: driver left SR-IOV enabled after remove
[ 3989.718288] nvidia 0000:03:00.0: driver left SR-IOV enabled after remove
[ 3989.718588] nvidia 0000:64:00.0: driver left SR-IOV enabled after remove
[ 3989.719319] nvidia 0000:63:00.0: driver left SR-IOV enabled after remove
[ 3989.719774] nvidia 0000:e4:00.0: driver left SR-IOV enabled after remove
[ 3989.720154] nvidia 0000:e3:00.0: driver left SR-IOV enabled after remove
[ 3989.720839] nvidia-nvlink: Unregistered Nvlink Core, major device number 499
Searching for conflicting files:: Searching

  [##############################] 100%
Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (550.54.10):: Installing

  [#                             ]   0%
Unable to determine whether NVIDIA kernel modules are present in the initramfs. Existing NVIDIA kernel modules in the initramfs, if any, may interfere with the newly installed driver.

  [##############################] 100%
Driver file installation is complete.
Running distribution scripts: Executing /usr/lib/nvidia/post-install

  [##############################] 100%
Running post-install sanity check:: Checking

  [##############################] 100%
Post-install sanity check passed.

Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 550.54.10) is now complete.

+ _load_driver
+ /usr/bin/nvidia-vgpud
+ '[' '!' -f /sys/module/nvidia_vgpu_vfio/refcnt ']'
+ /usr/bin/nvidia-vgpu-mgr
+ '[' '!' -f /sys/module/nvidia/refcnt ']'
+ return 0
+ _mount_rootfs
+ echo 'Mounting NVIDIA driver rootfs...'
+ mount -o remount,rw /sys
Mounting NVIDIA driver rootfs...
+ mount --make-runbindable /sys
+ mount --make-private /sys
+ mkdir -p /run/nvidia/driver
+ mount --rbind / /run/nvidia/driver
+ _enable_vfs
+ local retry
+ (( retry = 0  ))
+ (( retry <= 10  ))
+ /usr/lib/nvidia/sriov-manage -e ALL
GPU at 0000:03:00.0 already has VFs enabled.
GPU at 0000:04:00.0 already has VFs enabled.
GPU at 0000:63:00.0 already has VFs enabled.
GPU at 0000:64:00.0 already has VFs enabled.
GPU at 0000:83:00.0 already has VFs enabled.
GPU at 0000:84:00.0 already has VFs enabled.
GPU at 0000:e3:00.0 already has VFs enabled.
GPU at 0000:e4:00.0 already has VFs enabled.
+ return 0
+ pgrep nvidia-vgpu-mgr
+ nvidia-vgpud
+ echo 'Restarting nvidia-vgpu-mgr after previously killed'
+ nvidia-vgpu-mgr
Restarting nvidia-vgpu-mgr after previously killed
+ set +x
Done, now waiting for signal
ERROR: nvidia-vgpu-mgr daemon is no longer running. Exiting.
/usr/local/bin/nvidia-driver: line 1: popd: directory stack empty
  1. Dmesg log
Direct firmware load for nvidia/550.54.10/gsp_ga10x.bin failed with error -2

Image

  1. Kernel Version

Image

  1. Check GSP Firmware Version (N/A Value)
for gpu in /proc/driver/nvidia/gpus/*/information; do
    echo "File: $gpu"
    cat "$gpu"
    echo "-----------------------------"
done
File: /proc/driver/nvidia/gpus/0000:03:00.0/information                                                                                                                                                        
Model:           NVIDIA L40S                                                                                                                                                                                   
IRQ:             94                                                                                                                                                                                            
GPU UUID:        GPU-d03ff6db-34c7-dc00-484c-3adc1cc61b03                                                                                                                                                      
Video BIOS:      ??.??.??.??.??                                                                                                                                                                                
Bus Type:        PCIe                                                                                                                                                                                          
DMA Size:        47 bits                                                                                                                                                                                       
DMA Mask:        0x7fffffffffff                                                                                                                                                                                
Bus Location:    0000:03:00.0                                                                                                                                                                                  
Device Minor:    4                                                                                                                                                                                             
GPU Firmware:    N/A                                                                                                                                                                                           
GPU Excluded:    No                                                                                                                                                                                            
-----------------------------                                                                                                                                                                                  
File: /proc/driver/nvidia/gpus/0000:04:00.0/information                                                                                                                                                        
Model:           NVIDIA L40S                                                                                                                                                                                   
IRQ:             58                                                                                                                                                                                            
GPU UUID:        GPU-0418f843-80fe-7d93-cb41-72ecf0a117de                                                                                                                                                      
Video BIOS:      ??.??.??.??.??                                                                                                                                                                                
Bus Type:        PCIe                                                                                                                                                                                          
DMA Size:        47 bits                                                                                                                                                                                       
DMA Mask:        0x7fffffffffff                                                                                                                                                                                
Bus Location:    0000:04:00.0                                                                                                                                                                                  
Device Minor:    5                                                                                                                                                                                             
GPU Firmware:    N/A                                                                                                                                                                                           
GPU Excluded:    No                                                                                                                                                                                            
-----------------------------                                                                                                                                                                                  
File: /proc/driver/nvidia/gpus/0000:63:00.0/information                                                                                                                                                        
Model:           NVIDIA L40S                                                                                                                                                                                   
IRQ:             91                                                                                                                                                                                            
GPU UUID:        GPU-6c383654-2e10-7167-a1e0-fb8e8ba4b7bc                                                                                                                                                      
Video BIOS:      ??.??.??.??.??                                                                                                                                                                                
Bus Type:        PCIe                                                                                                                                                                                          
DMA Size:        47 bits                                                                                                                                                                                       
DMA Mask:        0x7fffffffffff                                                                                                                                                                                
Bus Location:    0000:63:00.0                                                                                                                                                                                  
Device Minor:    2                                                                                                                                                                                             
GPU Firmware:    N/A                                                                                                                                                                                           
GPU Excluded:    No                                                                                                                                                                                            
-----------------------------
File: /proc/driver/nvidia/gpus/0000:64:00.0/information                                                                                                                                                        
Model:           NVIDIA L40S                                                                                                                                                                                   
IRQ:             51                                                                                                                                                                                            
GPU UUID:        GPU-6d0940a2-6511-6aa0-2255-ce36f96b530b                                                                                                                                                      
Video BIOS:      ??.??.??.??.??                                                                                                                                                                                
Bus Type:        PCIe                                                                                                                                                                                          
DMA Size:        47 bits                                                                                                                                                                                       
DMA Mask:        0x7fffffffffff                                                                                                                                                                                
Bus Location:    0000:64:00.0
Device Minor:    3
GPU Firmware:    N/A
GPU Excluded:    No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:83:00.0/information
Model:           NVIDIA L40S
IRQ:             890
GPU UUID:        GPU-9a36f396-473f-b5b7-ba8c-b6e6c2cfd93e
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        47 bits
DMA Mask:        0x7fffffffffff
Bus Location:    0000:83:00.0
Device Minor:    6
GPU Firmware:    N/A
GPU Excluded:    No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:84:00.0/information
Model:           NVIDIA L40S
IRQ:             70
GPU UUID:        GPU-f6916d2a-2c75-c840-5106-af1e5b80f25c
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        47 bits
DMA Mask:        0x7fffffffffff
Bus Location:    0000:84:00.0
Device Minor:    7
GPU Firmware:    N/A
GPU Excluded:    No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:e3:00.0/information
Model:           NVIDIA L40S
IRQ:             889
GPU UUID:        GPU-71d50d8a-31d7-2028-4bca-e728fe84441c
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        47 bits
DMA Mask:        0x7fffffffffff
Bus Location:    0000:e3:00.0
Device Minor:    0
GPU Firmware:    N/A
GPU Excluded:    No
-----------------------------
File: /proc/driver/nvidia/gpus/0000:e4:00.0/information
Model:           NVIDIA L40S
IRQ:             44
GPU UUID:        GPU-e580ef25-9fe7-74f7-33c9-03bfa563ebb2
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        47 bits
DMA Mask:        0x7fffffffffff
Bus Location:    0000:e4:00.0
Device Minor:    1
GPU Firmware:    N/A
GPU Excluded:    No
-----------------------------
@AvnanRahman
Copy link

AvnanRahman commented Mar 7, 2025

I'm having the same issue here. The GPU operator works fine with passthrough and container mode. but while vGpu always failed:

k describe node nodename | grep "nvidia.com"
                    nvidia.com/gpu.deploy.cc-manager=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.sandbox-device-plugin=paused-for-vgpu-change
                    nvidia.com/gpu.deploy.sandbox-validator=paused-for-vgpu-change
                    nvidia.com/gpu.deploy.vgpu-device-manager=true
                    nvidia.com/gpu.deploy.vgpu-manager=true
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.workload.config=vm-vgpu
                    nvidia.com/vgpu.config=L40S-24Q
                    nvidia.com/vgpu.config.state=failed

gpu-operator pod

kubectl get pod -n gpu-operator --field-selector spec.nodeName=jah1ab07sregxs032 -w
NAME                                               READY   STATUS     RESTARTS      AGE
gpu-operator-node-feature-discovery-worker-wfvz2   1/1     Running    0             4m39s
nvidia-vgpu-device-manager-p4htg                   0/1     Init:0/1   0             4m4s
nvidia-vgpu-manager-daemonset-dgrrs                1/1     Running    2 (64s ago)   4m35s
nvidia-vgpu-manager-daemonset-dgrrs                0/1     Error      2 (5m50s ago)   9m21s
nvidia-vgpu-manager-daemonset-dgrrs                0/1     CrashLoopBackOff   2 (13s ago)     9m33s
nvidia-vgpu-manager-daemonset-dgrrs                1/1     Running            3 (27s ago)     9m47s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants