No devices were found in openshift #641

garyyang85 · 2023-12-21T01:28:06Z

1. Quick Debug Information

OS/Version Red Hat Enterprise Linux CoreOS release 4.12
Kernel Version: 4.18.0-372.69.1.el8_6.x86_64
Container Runtime Type/Version: CRI-O
Openshift 4.12.29
GPU Operator Version: 23.9.1

2. Issue or feature description

nvidia-driver-daemonset-xx pod reports "Startup probe failed: No devices were found" in events, but I can see the v100 GPU is ready on the os, below is the "lspci" output

03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

3. Steps to reproduce the issue

Deploy the GPU operator, cluster-policy definition.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  creationTimestamp: '2023-12-20T13:06:29Z'
  generation: 2
  name: gpu-cluster-policy
  resourceVersion: '275859864'
  uid: 71e06b17-5b47-4ab0-aae9-8034a2e30e42
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    enabled: true
    certConfig:
      name: ''
    repository: nvcr.io/nvidia
    kernelModuleConfig:
      name: ''
    usePrecompiled: false
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    version: 535.104.05
    virtualTopology:
      config: ''
    image: driver
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2024-01-25T22:38:25Z

@garyyang85 No devices were found typically indicates that GPU initialization failed. Can you get system logs by running dmesg | grep -i nvrm on the host?

fzhan · 2024-06-04T11:44:30Z

I have "[189160.303788] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.15" from dmesg | grep -i nvrm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No devices were found in openshift #641

No devices were found in openshift #641

garyyang85 commented Dec 21, 2023

cdesiniotis commented Jan 25, 2024

fzhan commented Jun 4, 2024 •

edited

Loading

No devices were found in openshift #641

No devices were found in openshift #641

Comments

garyyang85 commented Dec 21, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

cdesiniotis commented Jan 25, 2024

fzhan commented Jun 4, 2024 • edited Loading

fzhan commented Jun 4, 2024 •

edited

Loading