Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No devices were found in openshift #641

Open
garyyang85 opened this issue Dec 21, 2023 · 2 comments
Open

No devices were found in openshift #641

garyyang85 opened this issue Dec 21, 2023 · 2 comments

Comments

@garyyang85
Copy link

1. Quick Debug Information

  • OS/Version Red Hat Enterprise Linux CoreOS release 4.12
  • Kernel Version: 4.18.0-372.69.1.el8_6.x86_64
  • Container Runtime Type/Version: CRI-O
  • Openshift 4.12.29
  • GPU Operator Version: 23.9.1

2. Issue or feature description

nvidia-driver-daemonset-xx pod reports "Startup probe failed: No devices were found" in events, but I can see the v100 GPU is ready on the os, below is the "lspci" output

03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

3. Steps to reproduce the issue

Deploy the GPU operator, cluster-policy definition.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  creationTimestamp: '2023-12-20T13:06:29Z'
  generation: 2
  name: gpu-cluster-policy
  resourceVersion: '275859864'
  uid: 71e06b17-5b47-4ab0-aae9-8034a2e30e42
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    enabled: true
    certConfig:
      name: ''
    repository: nvcr.io/nvidia
    kernelModuleConfig:
      name: ''
    usePrecompiled: false
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    repoConfig:
      configMapName: ''
    version: 535.104.05
    virtualTopology:
      config: ''
    image: driver
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
@cdesiniotis
Copy link
Contributor

@garyyang85 No devices were found typically indicates that GPU initialization failed. Can you get system logs by running dmesg | grep -i nvrm on the host?

@fzhan
Copy link

fzhan commented Jun 4, 2024

I have "[189160.303788] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.54.15" from dmesg | grep -i nvrm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants