-
Notifications
You must be signed in to change notification settings - Fork 1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
bee7c7f
commit 90fa1b9
Showing
2 changed files
with
165 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
# GPU Sharing User guide | ||
|
||
## Environment setup | ||
|
||
### Install volcano | ||
|
||
|
||
#### 1. Install from source | ||
|
||
Refer to [Install Guide](../../installer/README.md) to install volcano. | ||
|
||
After installed, update the scheduler configuration: | ||
|
||
```shell script | ||
kubectl edit cm -n volcano-system volcano-scheduler-configmap | ||
``` | ||
|
||
```yaml | ||
kind: ConfigMap | ||
apiVersion: v1 | ||
metadata: | ||
name: volcano-scheduler-configmap | ||
namespace: volcano-system | ||
data: | ||
volcano-scheduler.conf: | | ||
actions: "enqueue, allocate, backfill" | ||
tiers: | ||
- plugins: | ||
- name: priority | ||
- name: gang | ||
- name: conformance | ||
- plugins: | ||
- name: drf | ||
- name: predicates | ||
arguments: | ||
predicate.GPUSharingEnable: true # enable gpu sharing | ||
- name: proportion | ||
- name: nodeorder | ||
- name: binpack | ||
``` | ||
#### 2. Install from release package. | ||
Same as above, after installed, update the scheduler configuration in `volcano-scheduler-configmap` configmap. | ||
|
||
### Install Volcano device plugin | ||
|
||
Please refer to [volcano device plugin](https://github.com/volcano-sh/devices/blob/master/README.md#quick-start) | ||
|
||
### Verify environment is ready | ||
|
||
By get the node status, if there is | ||
|
||
```shell script | ||
$ kubectl get node {node name} -oyaml | ||
... | ||
status: | ||
addresses: | ||
- address: 172.17.0.3 | ||
type: InternalIP | ||
- address: volcano-control-plane | ||
type: Hostname | ||
allocatable: | ||
cpu: "4" | ||
ephemeral-storage: 123722704Ki | ||
hugepages-1Gi: "0" | ||
hugepages-2Mi: "0" | ||
memory: 8174332Ki | ||
pods: "110" | ||
volcano.sh/gpu-memory: "89424" | ||
volcano.sh/gpu-number: "8" # GPU resource | ||
capacity: | ||
cpu: "4" | ||
ephemeral-storage: 123722704Ki | ||
hugepages-1Gi: "0" | ||
hugepages-2Mi: "0" | ||
memory: 8174332Ki | ||
pods: "110" | ||
volcano.sh/gpu-memory: "89424" | ||
volcano.sh/gpu-number: "8" # GPU resource | ||
``` | ||
|
||
### Running GPU Sharing Jobs | ||
|
||
NVIDIA GPUs can now be shared via container level resource requirements using the resource name `volcano.sh/gpu-memory`: | ||
```shell script | ||
$ cat <<EOF | kubectl apply -f - | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod1 | ||
spec: | ||
containers: | ||
- name: cuda-container | ||
image: nvidia/cuda:9.0-devel | ||
command: ["sleep"] | ||
args: ["100000"] | ||
resources: | ||
limits: | ||
volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory | ||
EOF | ||
$ cat <<EOF | kubectl apply -f - | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: gpu-pod2 | ||
spec: | ||
containers: | ||
- name: cuda-container | ||
image: nvidia/cuda:9.0-devel | ||
command: ["sleep"] | ||
args: ["100000"] | ||
resources: | ||
limits: | ||
volcano.sh/gpu-memory: 1024 # requesting 1024MB GPU memory | ||
EOF | ||
``` | ||
|
||
If the pods are running in a idle cluster, you can see the pods sharing one gpu card: | ||
|
||
```shell script | ||
$ kubectl exec -ti gpu-pod1 env | ||
... | ||
VOLCANO_GPU_TOTAL=11178 | ||
VOLCANO_GPU_ALLOCATED=1024 | ||
NVIDIA_VISIBLE_DEVICES=0 | ||
... | ||
$ kubectl exec -ti gpu-pod1 env | ||
... | ||
VOLCANO_GPU_TOTAL=11178 | ||
VOLCANO_GPU_ALLOCATED=1024 | ||
NVIDIA_VISIBLE_DEVICES=0 | ||
... | ||
``` | ||
|
||
### Understanding how GPU sharing works | ||
|
||
The GPU sharing workflow is depicted as below: | ||
|
||
![gpu_sharing](../images/gpu-share-flow.png) | ||
|
||
1. create a pod with `volcano.sh/gpu-memory` resource request, | ||
|
||
2. volcano scheduler predicates and allocate gpu resource for the pod. Adding the below annotation | ||
|
||
```yaml | ||
annotations: | ||
volcano.sh/gpu-index: “0” | ||
volcano.sh/predicate-time: “1593764466550835304” | ||
``` | ||
|
||
3. kubelet watches the pod bound to itself, and call allocate API to set env before running the container. | ||
|
||
```yaml | ||
env: | ||
NVIDIA_VISIBLE_DEVICES: “0” # GPU cardW index | ||
VOLCANO_GPU_ALLOCATED: “1024” # GPU allocated | ||
VOLCANO_GPU_TOTAL: “11178” # GPU memory of the card | ||
``` | ||
|
||
|
||
|
||
|