Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of GPU hot-plug #1263

Open
hase1128 opened this issue Feb 12, 2025 · 1 comment
Open

Support of GPU hot-plug #1263

hase1128 opened this issue Feb 12, 2025 · 1 comment

Comments

@hase1128
Copy link

1. Background
Composable Disaggregated Infrastructure (CDI) is emerging new server architecture.
This architecture includes PCIe switches and PCIe devices connected to the PCIe switches, which works as a resource pool. CDI enables users to dynamically configure physical servers by software definition. As a result, it can dynamically attach or detach GPUs without OS reboot.
CDI has control API. It can be accessed by operators CLI or GUI, or by management software’s.

We plan to make it possible for Kubernetes(K8s) to control CDI. In this scenario, there are 4 steps.

  1. K8s has ResourceSlice (which is exposed by Dynamic Resource Allocation feature in K8s) representing free devices in resource pool.
  2. When a pod is unscheduled due to unavailability of GPUs but find free devices in resource pool, it could trigger a request for CDI to attach devices from resource pool to worker nodes.
  3. After attaching GPUs, DRA plugin detects the attached GPUs and the unscheduled pod become scheduled.
  4. Detecting there are no pods using attached GPUs, K8s detaches these GPUs and return to resource pool.

Image

2. Problem statement
When we try above the scenario, we face following problems.

  • In case of attach phase (above step 3), Nvidia GPU operator does not recognize newly attached GPUs(More precisely, Pods such as dcgm managed by the GPU Operator). As a result, K8s fails to attach GPUs.
  • In case of detach (above step 4), several Pods managed by GPU Operator (such as Nvidia-persistenced) keep opening the GPU's device files. As a result, K8s fails to detach GPUs.

3. Cause

  • 3.1. Attach phase
    Several Pods managed by GPU Operator (such as dcgm, device-plugin, etc) is designed to discover GPUs only once at program startup.
    Even Node feature discovery (NFD) rescans and detects a newly attached GPUs, these Pods cannot detect newly attached GPUs.
  • 3.2. Detach phase
    Several Pods managed by GPU Operator (such as nvidia-persistenced, driver-daemonset, etc) keep opening the GPU's device file, even though disabling a GPU using “nvidia-smi drain -r” command.

4. Proposal
Attach phase

  • Method1
    Implement a mechanism for each Pod managed by GPU Operator to periodically discover the hot-plugged GPUs (polling method).
  • Method2
    Implement a signal handler in each Pod managed by GPU Operator to make it possible to rescan attached GPUs by receiving signals. There are two options who sends a signal.
    • Option1:
      Nvidia driver in kernel space is able to detect hot-plugged GPUs. So after detecting GPU attached, Nvidia driver send a signal including GPU ID to each Pod. When each Pod catch this signal, they discover the hot-plugged GPUs specified in the signal.
    • Option2:
      When added a hot-plugged GPU, could Nvidia-smi send a signal to each Pod triggered by CDI, and each Pod discover the hot-plugged GPUs in the signal?

Detach phase
When Nvidia kernel driver recognizes the detachment of GPUs from “nvidia-smi drain -r” command, it will send kill signal to all processes (pods) opening the device files for the GPUs. However, some Pods such as nvidia-persistenced and dcgm processes do not close the device files. So, we propose following modifications.

  • Method1

      1. Implement a mechanism for Nvidia kernel driver to send a signal to each Pod opening device files.
      1. Implement a signal handler in those processes to close the device files for the detaching GPUs when they receive a signal.
  • Method2

      1. Implement a mechanism for Nvidia-smi to send a signal to each Pod opening the device files, when remove the GPU by using “nvidia-smi drain -r”
      1. Implement a signal handler in those processes to close the device files for the detaching GPUs when they receive a signal.
@hase1128
Copy link
Author

@klueska
We are proposing GPU hot-plug by using DRA.
kubernetes/enhancements#5012

To achieve this, we need to improve k8s scheduler, existing DRA driver and develop composable DRA controller.
Additionally, some components deployed by the GPU Operator also need to support hot plug.
We have made the above proposal, but please confirm if it is technically possible to support hot-plug, or assign a suitable person to it, and then let me discuss the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant