Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Problem]nvidia-driver init fail #1264

Open
g3david405 opened this issue Feb 12, 2025 · 0 comments
Open

[Problem]nvidia-driver init fail #1264

g3david405 opened this issue Feb 12, 2025 · 0 comments

Comments

@g3david405
Copy link

Here is my k8s worker vm information:

NAME="Rocky Linux"
VERSION="9.1 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.1"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.1 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.1"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.1"

kernel version: 5.14.0-503.23.1.el9_5.x86_64

gpu-operator-node-feature-discovery-worker works fine,
However, when the nvidia-driver-daemonset executes nvidia-driver init, the following error occurs:

+ modprobe -a i2c_core ipmi_msghandler ipmi_devintf
modprobe: ERROR: could not insert 'ipmi_msghandler': Invalid argument
modprobe: ERROR: could not insert 'ipmi_devintf': Invalid argument

Then, the container fails and restarts.
Could this be due to dependency mismatches?

Here is the tail of install log

Loading ipmi and i2c_core kernel modules...
+ modprobe -a i2c_core ipmi_msghandler ipmi_devintf
modprobe: ERROR: could not insert 'ipmi_msghandler': Invalid argument
modprobe: ERROR: could not insert 'ipmi_devintf': Invalid argument
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager
.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0
@g3david405 g3david405 changed the title nvidia-driver init fail [Problem]nvidia-driver init fail Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant