Skip to content

Commit

Permalink
Merge pull request #230 from dystewart/gpu-doc
Browse files Browse the repository at this point in the history
Update RHOAI docs to reflect changes coming Jan 14
  • Loading branch information
jtriley authored Jan 16, 2025
2 parents 960ad55 + e7e5824 commit 6ced34b
Show file tree
Hide file tree
Showing 4 changed files with 39 additions and 18 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ On the Create workbench page, complete the following information.

- Notebook image (Image selection)

- Deployment size (Container size and Number of GPUs)
- Deployment size (Container size, Type and Number of GPUs)

- Environment variables

Expand All @@ -82,13 +82,15 @@ On the Create workbench page, complete the following information.
resources, including CPUs and memory. Each container size comes with pre-configured
CPU and memory resources.

Optionally, you can specify the desired **Number of GPUs** depending on the
Optionally, you can specify the desired **Accelerator** and **Number of Accelerators** (GPUs), depending on the
nature of your data analysis and machine learning code requirements. However,
this number should not exceed the GPU quota specified by the value of the
"**OpenShift Request on GPU Quota**" attribute that has been approved for
this "**NERC-OCP (OpenShift)**" resource allocation on NERC's ColdFront, as
[described here](../../get-started/allocation/allocation-details.md#pi-and-manager-allocation-view-of-openshift-resource-allocation).

The different options for accelerator are "NVIDIA A100 GPU", "NVIDIA V100 GPU", and "NONE".

If you need to increase this quota value, you can request a change as
[explained here](../../get-started/allocation/allocation-change-request.md#request-change-resource-allocation-attributes-for-openshift-project).

Expand All @@ -97,7 +99,8 @@ Once you have entered the information for your workbench, click **Create**.
![Fill Workbench Information](images/tensor-flow-workbench.png)

For our example project, let's name it "Tensorflow Workbench". We'll select the
**TensorFlow** image, choose a **Deployment size** of **Small**, **Number of GPUs**
**TensorFlow** image, choose a **Deployment size** of **Small**,
**Accelerator** of **NVIDIA A100 GPU**, **Number of Accelerators**
as **1** and allocate a **Cluster storage** space of **1GB**.

!!! info "More About Cluster Storage"
Expand Down
46 changes: 31 additions & 15 deletions docs/openshift/applications/scaling-and-performance-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,8 @@ Gi, Mi, Ki).
## How to specify pod to use GPU?

So from a **Developer** perspective, the only thing you have to worry about is
asking for GPU resources when defining your pods, with something like:
asking for GPU resources when defining your pods, with something like the
following for requesting (NVIDIA A100 GPU):

spec:
containers:
Expand All @@ -150,14 +151,26 @@ asking for GPU resources when defining your pods, with something like:
limits:
memory: "128Mi"
cpu: "500m"
tolerations:
- key: nvidia.com/gpu.product
operator: Equal
value: NVIDIA-A100-SXM4-40GB
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB

In the sample Pod Spec above, you can allocate GPUs to pods by specifying the GPU
In the sample Pod Spec above, you can allocate GPUs to containers by specifying
the GPU
resource `nvidia.com/gpu` and indicating the desired number of GPUs. This number
should not exceed the GPU quota specified by the value of the
"**OpenShift Request on GPU Quota**" attribute that has been approved for your
"**NERC-OCP (OpenShift)**" resource allocation on NERC's ColdFront as
[described here](../../get-started/allocation/allocation-details.md#pi-and-manager-allocation-view-of-openshift-resource-allocation).

!!! note "Pod Spec: tolerations & nodeSelector"

When requesting GPU resources directly from pods and deployments, you must include the spec.tolerations and spec.nodeSelector shown above, for your ddesired GPU type.

If you need to increase this quota value, you can request a change as
[explained here](../../get-started/allocation/allocation-change-request.md#request-change-resource-allocation-attributes-for-openshift-project).

Expand Down Expand Up @@ -203,22 +216,25 @@ the name of the GPU device:
We can specify information about the GPU product type, family, count, and so on,
as shown in the Pod Spec above. Also, these node labels can be used in the Pod Spec
to schedule workloads based on criteria such as the GPU device name, as shown under
_nodeSelector_ as shown below:
_nodeSelector_ in this case (NVIDIA V100 GPU):

apiVersion: v1
kind: Pod
metadata:
name: gpu-pod2
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
command: ["sleep"]
args: ["infinity"]
resources:
limits:
nvidia.com/gpu: 1
- name: app
image: ...
resources:
requests:
memory: "64Mi"
cpu: "250m"
nvidia.com/gpu: 1
limits:
memory: "128Mi"
cpu: "500m"
tolerations:
- key: nvidia.com/gpu.product
operator: Equal
value: Tesla-V100-PCIE-32GB
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.product: Tesla-V100-PCIE-32GB

Expand Down
2 changes: 2 additions & 0 deletions nerc-theme/main.html
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
<!--
{% extends "base.html" %} {% block announce %}
<div class="parent">
<div class="maintain">
Expand All @@ -23,6 +24,7 @@
></iframe>
</div>
</div>
-->
{% endblock %} {% block htmltitle %}
<title>New England Research Cloud(NERC)</title>
{% endblock %} {% block footer %}
Expand Down

0 comments on commit 6ced34b

Please sign in to comment.