Skip to content

Commit

Permalink
Merge pull request #240 from dystewart/add-gpu-docs
Browse files Browse the repository at this point in the history
Add GPU subsection to OCP docs
  • Loading branch information
Milstein authored Jan 29, 2025
2 parents e31e0ca + d046c67 commit eede9e3
Show file tree
Hide file tree
Showing 3 changed files with 49 additions and 0 deletions.
43 changes: 43 additions & 0 deletions docs/openshift/gpus/intro-to-gpus-in-nerc-ocp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Introduction to GPUs in NERC OpenShift

NERC OCP clusters leverage the [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html)
as well as the [Node Feature Discovery Operator](https://docs.openshift.com/container-platform/4.15/hardware_enablement/psap-node-feature-discovery-operator.html)
to manage and deploy GPU worker nodes to clusters.

GPU nodes in NERC clusters are also managed via
[taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)
according to their GPU device.
This ensures that only workloads explicitly requesting GPUs will consume GPU
resources.

## NERC GPU Worker Node Arhitectures

The NERC OpenShift environment currently supports two different NVIDIA GPU
products:

1. NVIDIA-A100-SXM4-40GB (A100)
2. Tesla-V100-PCIE-32GB (V100)

A100 worker nodes contain 4 individual gpus, each with 40GB of memory
V100 worker nodes contain 1 gpu with 32 GB of memory

## Accessing GPU Resources

Access to GPU nodes is handled via OCP project allocations through NERC
ColdFront. By default, user projects in NERC OCP clusters do not have access to
GPUs and access must be granted through the user's ColdFront allocation by a
NERC admin.

## Deploying Workloads to GPUs

There are two ways to deploy workloads on GPU nodes:

1. Deploy directly in your OCP namespace:

In your project namespace you can deploy a GPU workload by explicitely
requesting a GPU in your manifest, see: [How to specify pod to use GPU](https://nerc-project.github.io/nerc-docs/openshift/applications/scaling-and-performance-guide/#how-to-specify-pod-to-use-gpu)

1. Deploy through RHOAI

See [Populate the data science project with a Workbench](https://nerc-project.github.io/nerc-docs/openshift-ai/data-science-project/using-projects-the-rhoai/#populate-the-data-science-project-with-a-workbench)
for selecting GPU options.
4 changes: 4 additions & 0 deletions docs/openshift/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ the list below.

- [Storage Overview](storage/storage-overview.md)

## GPUs

- [Intro to GPUs in NERC OCP Clusters](gpus/intro-to-gpus-in-nerc-ocp.md)

## Deleting Applications

- [Deleting your applications](applications/deleting-applications.md)
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,8 @@ nav:
- Editing Applications:
- Editing your applications: openshift/applications/editing-applications.md
- Scaling and Performance Guide: openshift/applications/scaling-and-performance-guide.md
- GPUs:
- Introduction to GPUs in NERC OpenShift: openshift/gpus/intro-to-gpus-in-nerc-ocp.md
- Storage:
- Storage Overview: openshift/storage/storage-overview.md
- Deleting Applications:
Expand Down

0 comments on commit eede9e3

Please sign in to comment.