Wait for GPU Operator Subscription, InstallPlan & Deployment #1108

manosnoam · 2024-01-11T16:24:37Z

To avoid situation where gpu_deploy.sh waits for nvidia-gpu-operator pods,
before they were created by the Nvidia GPU Operator installation,
it is required to initially wait for the Operator Subscription, InstallPlan & Deployment to complete.

github-actions · 2024-01-11T16:27:43Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %
397	0	0	397	100

manosnoam · 2024-01-11T16:47:33Z

This has been tested both on Cluster without Nvidia operator, and also when the operator was already installed.

Output example when running the commands on a cluster without the Operator:

22:05:50  Deploying Nvidia GPU Operator
[Pipeline] sh
22:05:51  + /home/jenkins/workspace/rhods/rhods-smoke/ods-ci/ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh
22:05:51  Create and apply 'gpu_install.yaml' to install Nvidia GPU Operator
22:05:53  namespace/nvidia-gpu-operator unchanged
22:05:53  operatorgroup.operators.coreos.com/nvidia-gpu-operator-group created
22:05:54  subscription.operators.coreos.com/gpu-operator-certified created
22:05:54  subscription.operators.coreos.com/nfd created
22:05:54  Wait for Nvidia GPU Operator Subscription, InstallPlan and Deployment to complete
22:06:20  subscription.operators.coreos.com/nfd condition met
22:06:20  subscription.operators.coreos.com/gpu-operator-certified condition met
22:06:38  installplan.operators.coreos.com/install-r7ghv condition met
22:06:38  installplan.operators.coreos.com/install-xtsqx condition met
22:06:38  Waiting for deployment "gpu-operator" rollout to finish: 1 old replicas are pending termination...
22:06:41  Waiting for deployment "gpu-operator" rollout to finish: 1 old replicas are pending termination...
22:06:41  deployment "gpu-operator" successfully rolled out
22:06:41  deployment "nfd-controller-manager" successfully rolled out
22:06:42  operator.operators.coreos.com/nfd.nvidia-gpu-operator condition met
22:06:42  operator.operators.coreos.com/gpu-operator-certified.nvidia-gpu-operator condition met
22:06:43  gpu-operator-694b7778cb-bxm2c   1/1   Terminating   0     30s
22:06:43  gpu-operator-c66699dfb-g4m7q    1/1   Running       0     13s
22:06:43  Waiting until GPU pods of 'gpu-operator' in namespace 'nvidia-gpu-operator' are in running state...
22:06:43  pod/gpu-operator-694b7778cb-bxm2c condition met
22:06:43  pod/gpu-operator-c66699dfb-g4m7q condition met
22:06:49  nodefeaturediscovery.nfd.openshift.io/nfd-instance created
22:06:54  clusterpolicy.nvidia.com/gpu-cluster-policy configured
22:06:54  
22:06:54  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:00  
22:07:00  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:06  
22:07:06  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:11  
22:07:11  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:16  
22:07:16  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:21  
22:07:21  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:26  
22:07:26  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:32  
22:07:32  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:37  
22:07:37  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:43  
22:07:43  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:49  
22:07:49  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:54  
22:07:54  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:59  
22:07:59  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:08:04  nvidia-device-plugin-daemonset-fp7dk   0/1   Init:0/1   0     2s
22:08:04  Waiting until GPU pods of 'nvidia-device-plugin-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/nvidia-device-plugin-daemonset-fp7dk condition met
22:11:56  nvidia-container-toolkit-daemonset-4v4pj   1/1   Running   0     3m44s
22:11:56  Waiting until GPU pods of 'nvidia-container-toolkit-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/nvidia-container-toolkit-daemonset-4v4pj condition met
22:11:56  nvidia-dcgm-exporter-z4cqs   1/1   Running   0     3m46s
22:11:56  Waiting until GPU pods of 'nvidia-dcgm-exporter' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/nvidia-dcgm-exporter-z4cqs condition met
22:11:56  gpu-feature-discovery-4p44f   1/1   Running   0     3m46s
22:11:56  Waiting until GPU pods of 'gpu-feature-discovery' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/gpu-feature-discovery-4p44f condition met
22:11:56  nvidia-operator-validator-qcmd5   0/1   Init:3/4   2     3m47s
22:11:56  Waiting until GPU pods of 'nvidia-operator-validator' in namespace 'nvidia-gpu-operator' are in running state...
22:12:02  pod/nvidia-operator-validator-qcmd5 condition met
22:12:02  Deleting configmap migration-gpu-status
22:12:02  configmap "migration-gpu-status" deleted
22:12:02  Rollout restart rhods-dashboard deployment
22:12:02  deployment.apps/rhods-dashboard restarted
22:12:02  Waiting for up to 3 minutes until rhods-dashboard deployment is rolled out
22:12:03  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 4 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:14:06  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:14:06  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:14:06  Waiting for deployment "rhods-dashboard" rollout to finish: 4 of 5 updated replicas are available...
22:14:06  deployment "rhods-dashboard" successfully rolled out
22:14:06  Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications
22:14:06  Name:         migrated-gpu
22:14:06  Namespace:    redhat-ods-applications
22:14:06  Labels:       <none>
22:14:06  Annotations:  <none>
22:14:06  API Version:  dashboard.opendatahub.io/v1
22:14:06  Kind:         AcceleratorProfile
22:14:06  Metadata:
22:14:06    Creation Timestamp:  2024-01-11T12:04:13Z
22:14:06    Generation:          1
22:14:06    Managed Fields:
22:14:06      API Version:  dashboard.opendatahub.io/v1
22:14:06      Fields Type:  FieldsV1
22:14:06      fieldsV1:
22:14:06        f:spec:
22:14:06          .:
22:14:06          f:displayName:
22:14:06          f:enabled:
22:14:06          f:identifier:
22:14:06          f:tolerations:
22:14:06      Manager:         unknown
22:14:06      Operation:       Update
22:14:06      Time:            2024-01-11T12:04:13Z
22:14:06    Resource Version:  296166
22:14:06    UID:               a0a2ff7c-efe4-4664-8d2b-b25051854493
22:14:06  Spec:
22:14:06    Display Name:  NVIDIA GPU
22:14:06    Enabled:       true
22:14:06    Identifier:    nvidia.com/gpu
22:14:06    Tolerations:
22:14:06      Effect:    NoSchedule
22:14:06      Key:       nvidia.com/gpu
22:14:06      Operator:  Exists
22:14:06  Events:        <none>

tarukumar · 2024-01-12T07:12:55Z

ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh

+
+oc wait installplan -n nvidia-gpu-operator --all --for condition=Installed --timeout=3m
+
+oc rollout status -n nvidia-gpu-operator deployment gpu-operator --watch --timeout=3m


I don't think so rollout is necessary. Also could you add the check for nvidia operator as well

I've updated the script to also wait for gpu-operator-certified subscription, and for the nfd-controller-manager deployment.
The rollout status is still needed, see the updated execution in my previous comment.
I also added wait for Operator's resource to exists, and the wait for pods as you suggested.

Signed-off-by: manosnoam <[email protected]>

sonarqubecloud · 2024-01-14T20:20:50Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

manosnoam force-pushed the gpu_wait branch from 2dc0b1b to e7e297f Compare January 11, 2024 16:35

FedeAlonso previously approved these changes Jan 11, 2024

View reviewed changes

manosnoam requested review from jstourac, tarukumar and bdattoma January 11, 2024 16:49

tarukumar reviewed Jan 12, 2024

View reviewed changes

Wait for GPU Operator Subscription, InstallPlan & Deployment to complete

b48b7be

Signed-off-by: manosnoam <[email protected]>

manosnoam dismissed FedeAlonso’s stale review via b48b7be January 14, 2024 13:25

manosnoam force-pushed the gpu_wait branch from 8fae730 to b48b7be Compare January 14, 2024 13:25

manosnoam requested a review from FedeAlonso January 14, 2024 13:39

manosnoam force-pushed the gpu_wait branch 2 times, most recently from a22c982 to e9992f5 Compare January 14, 2024 18:09

Wait for GPU Operator resource to exists

5d17865

Signed-off-by: manosnoam <[email protected]>

manosnoam force-pushed the gpu_wait branch from e9992f5 to 5d17865 Compare January 14, 2024 19:03

Wait for GPU pods to be present

4ba4bd8

Signed-off-by: manosnoam <[email protected]>

manosnoam force-pushed the gpu_wait branch from 1808ea1 to 4ba4bd8 Compare January 14, 2024 20:20

manosnoam added the verified This PR has been tested with Jenkins label Jan 14, 2024

tarukumar approved these changes Jan 15, 2024

View reviewed changes

FedeAlonso approved these changes Jan 15, 2024

View reviewed changes

manosnoam merged commit b0c8743 into red-hat-data-services:master Jan 15, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for GPU Operator Subscription, InstallPlan & Deployment #1108

Wait for GPU Operator Subscription, InstallPlan & Deployment #1108

manosnoam commented Jan 11, 2024

github-actions bot commented Jan 11, 2024

manosnoam commented Jan 11, 2024 •

edited

Loading

tarukumar Jan 12, 2024

manosnoam Jan 14, 2024 •

edited

Loading

sonarqubecloud bot commented Jan 14, 2024


		oc wait installplan -n nvidia-gpu-operator --all --for condition=Installed --timeout=3m

		oc rollout status -n nvidia-gpu-operator deployment gpu-operator --watch --timeout=3m

Wait for GPU Operator Subscription, InstallPlan & Deployment #1108

Wait for GPU Operator Subscription, InstallPlan & Deployment #1108

Conversation

manosnoam commented Jan 11, 2024

github-actions bot commented Jan 11, 2024

Robot Results

manosnoam commented Jan 11, 2024 • edited Loading

tarukumar Jan 12, 2024

Choose a reason for hiding this comment

manosnoam Jan 14, 2024 • edited Loading

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 14, 2024

Quality Gate passed

manosnoam commented Jan 11, 2024 •

edited

Loading

manosnoam Jan 14, 2024 •

edited

Loading