You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Background
Composable Disaggregated Infrastructure (CDI) is emerging new server architecture.
This architecture includes PCIe switches and PCIe devices connected to the PCIe switches, which works as a resource pool. CDI enables users to dynamically configure physical servers by software definition. As a result, it can dynamically attach or detach GPUs without OS reboot.
CDI has control API. It can be accessed by operators CLI or GUI, or by management software’s.
We plan to make it possible for Kubernetes(K8s) to control CDI. In this scenario, there are 4 steps.
K8s has ResourceSlice (which is exposed by Dynamic Resource Allocation feature in K8s) representing free devices in resource pool.
When a pod is unscheduled due to unavailability of GPUs but find free devices in resource pool, it could trigger a request for CDI to attach devices from resource pool to worker nodes.
After attaching GPUs, DRA plugin detects the attached GPUs and the unscheduled pod become scheduled.
Detecting there are no pods using attached GPUs, K8s detaches these GPUs and return to resource pool.
2. Problem statement
When we try above the scenario, we face following problems.
In case of attach phase (above step 3), Nvidia GPU operator does not recognize newly attached GPUs(More precisely, Pods such as dcgm managed by the GPU Operator). As a result, K8s fails to attach GPUs.
In case of detach (above step 4), several Pods managed by GPU Operator (such as Nvidia-persistenced) keep opening the GPU's device files. As a result, K8s fails to detach GPUs.
3. Cause
3.1. Attach phase
Several Pods managed by GPU Operator (such as dcgm, device-plugin, etc) is designed to discover GPUs only once at program startup.
Even Node feature discovery (NFD) rescans and detects a newly attached GPUs, these Pods cannot detect newly attached GPUs.
3.2. Detach phase
Several Pods managed by GPU Operator (such as nvidia-persistenced, driver-daemonset, etc) keep opening the GPU's device file, even though disabling a GPU using “nvidia-smi drain -r” command.
4. Proposal
Attach phase
Method1
Implement a mechanism for each Pod managed by GPU Operator to periodically discover the hot-plugged GPUs (polling method).
Method2
Implement a signal handler in each Pod managed by GPU Operator to make it possible to rescan attached GPUs by receiving signals. There are two options who sends a signal.
Option1:
Nvidia driver in kernel space is able to detect hot-plugged GPUs. So after detecting GPU attached, Nvidia driver send a signal including GPU ID to each Pod. When each Pod catch this signal, they discover the hot-plugged GPUs specified in the signal.
Option2:
When added a hot-plugged GPU, could Nvidia-smi send a signal to each Pod triggered by CDI, and each Pod discover the hot-plugged GPUs in the signal?
Detach phase
When Nvidia kernel driver recognizes the detachment of GPUs from “nvidia-smi drain -r” command, it will send kill signal to all processes (pods) opening the device files for the GPUs. However, some Pods such as nvidia-persistenced and dcgm processes do not close the device files. So, we propose following modifications.
Method1
Implement a mechanism for Nvidia kernel driver to send a signal to each Pod opening device files.
Implement a signal handler in those processes to close the device files for the detaching GPUs when they receive a signal.
Method2
Implement a mechanism for Nvidia-smi to send a signal to each Pod opening the device files, when remove the GPU by using “nvidia-smi drain -r”
Implement a signal handler in those processes to close the device files for the detaching GPUs when they receive a signal.
The text was updated successfully, but these errors were encountered:
To achieve this, we need to improve k8s scheduler, existing DRA driver and develop composable DRA controller.
Additionally, some components deployed by the GPU Operator also need to support hot plug.
We have made the above proposal, but please confirm if it is technically possible to support hot-plug, or assign a suitable person to it, and then let me discuss the details.
1. Background
Composable Disaggregated Infrastructure (CDI) is emerging new server architecture.
This architecture includes PCIe switches and PCIe devices connected to the PCIe switches, which works as a resource pool. CDI enables users to dynamically configure physical servers by software definition. As a result, it can dynamically attach or detach GPUs without OS reboot.
CDI has control API. It can be accessed by operators CLI or GUI, or by management software’s.
We plan to make it possible for Kubernetes(K8s) to control CDI. In this scenario, there are 4 steps.
2. Problem statement
When we try above the scenario, we face following problems.
3. Cause
Several Pods managed by GPU Operator (such as dcgm, device-plugin, etc) is designed to discover GPUs only once at program startup.
Even Node feature discovery (NFD) rescans and detects a newly attached GPUs, these Pods cannot detect newly attached GPUs.
Several Pods managed by GPU Operator (such as nvidia-persistenced, driver-daemonset, etc) keep opening the GPU's device file, even though disabling a GPU using “nvidia-smi drain -r” command.
4. Proposal
Attach phase
Implement a mechanism for each Pod managed by GPU Operator to periodically discover the hot-plugged GPUs (polling method).
Implement a signal handler in each Pod managed by GPU Operator to make it possible to rescan attached GPUs by receiving signals. There are two options who sends a signal.
Nvidia driver in kernel space is able to detect hot-plugged GPUs. So after detecting GPU attached, Nvidia driver send a signal including GPU ID to each Pod. When each Pod catch this signal, they discover the hot-plugged GPUs specified in the signal.
When added a hot-plugged GPU, could Nvidia-smi send a signal to each Pod triggered by CDI, and each Pod discover the hot-plugged GPUs in the signal?
Detach phase
When Nvidia kernel driver recognizes the detachment of GPUs from “nvidia-smi drain -r” command, it will send kill signal to all processes (pods) opening the device files for the GPUs. However, some Pods such as nvidia-persistenced and dcgm processes do not close the device files. So, we propose following modifications.
Method1
Method2
The text was updated successfully, but these errors were encountered: