-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
58 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Introduction to GPUs in NERC OpenShift | ||
|
||
NERC OCP clusters leverage the | ||
[NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) | ||
as well as the | ||
[Node Feature Discovery Operator](https://docs.openshift.com/container-platform/4.15/hardware_enablement/psap-node-feature-discovery-operator.html) | ||
to manage and deploy GPU worker nodes to clusters. | ||
|
||
## NERC GPU Worker Node Arhitectures | ||
|
||
The NERC OpenShift environment currently supports two different NVIDIA GPU | ||
products: | ||
1. NVIDIA-A100-SXM4-40GB (A100) | ||
2. Tesla-V100-PCIE-32GB (V100) | ||
|
||
A100 worker nodes contain 4 individual gpus, each with 40GB of memory | ||
V100 worker nodes contain 1 gpu with 32 GB of memory | ||
|
||
## Accessing GPU Resources | ||
|
||
Access to GPU nodes is handled via OCP project allocations through NERC | ||
ColdFront. By default, user projects in NERC OCP clusters do not have access to | ||
GPUs and access must be granted through the user's ColdFront allocation by a | ||
NERC admin. | ||
|
||
## Deploying Workloads to GPUs | ||
|
||
There are two ways to deploy workloads on GPU nodes: | ||
|
||
1. Deploy directly in your OCP namespace: | ||
|
||
In your project namespace you can deploy a GPU workload by explicitely | ||
requesting a GPU in your manifest, like for instance: | ||
``` | ||
apiVersion: v1 | ||
kind: Pod | ||
metadata: | ||
name: sample-gpu-request | ||
spec: | ||
restartPolicy: Never | ||
containers: | ||
- name: sample-gpu-request | ||
image: <your-image-url> | ||
... | ||
... | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 | ||
``` | ||
|
||
2. Deploy through RHOAI | ||
|
||
See [Populate the data science project with a Workbench](https://github.com/nerc-project/nerc-docs/blob/main/docs/openshift-ai/data-science-project/using-projects-the-rhoai.md#populate-the-data-science-project-with-a-workbench) | ||
for selecting GPU options. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters