Verify GPU AcceleratorProfile is created after restarting Dashboard #1103

manosnoam · 2024-01-09T16:15:32Z

Add to gpu_deploy.sh script two verification steps after restarting rhods-dashboard deployment.

First wait for all relevant pods to be running.
Then delete configmap.
Then rollout restart rhods-dashboard deployment.
Then wait for rollout status and replicas to be successful.
At last verify that an AcceleratorProfiles resource is created.

Notice that the wait for pods running happens before the deployment is restarted.
That is why an additional wait for rollout is necessary to be called after restarting the deployment.

Add to gpu_deploy.sh script two verification steps after restarting rhods-dashboard deployment: - Wait for up to 3 minutes until the deployment is rolled out - Verify that an AcceleratorProfiles resource is created Signed-off-by: manosnoam <[email protected]>

github-actions · 2024-01-09T16:18:38Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %
393	0	0	393	100

FedeAlonso · 2024-01-09T16:21:35Z

ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh

+  oc rollout status deployment.apps/rhods-dashboard -n redhat-ods-applications --watch --timeout 3m
+
+  echo "Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications"
+  oc describe AcceleratorProfiles -n redhat-ods-applications


Why doing a describe?

would it be better to check the instance was created?

Describe checks if the resource was created + describes it, for example:

$▶ oc describe AcceleratorProfiles -n redhat-ods-applications Name: migrated-gpu Namespace: redhat-ods-applications Labels: <none> Annotations: <none> API Version: dashboard.opendatahub.io/v1 Kind: AcceleratorProfile Metadata: Creation Timestamp: 2024-01-09T15:46:11Z Generation: 1 Managed Fields: API Version: dashboard.opendatahub.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:displayName: f:enabled: f:identifier: f:tolerations: Manager: unknown Operation: Update Time: 2024-01-09T15:46:11Z Resource Version: 2416221 UID: ce19e4c7-97eb-46b6-b2f4-cc719e0d8e1b Spec: Display Name: NVIDIA GPU Enabled: true Identifier: nvidia.com/gpu Tolerations: Effect: NoSchedule Key: nvidia.com/gpu Operator: Exists Events: <none>

If the resource is missing it will fail with:
error: the server doesn't have a resource type "AcceleratorProfiles"

The previous implementation of the function did not handle completed pods, so the script hanged for ~45 minutes waiting for completed pods (e.g. nvidia-cuda-validator pods) to become running: ``` 19:24:53 GPU installation seems to be still running ... 20:13:48 nvidia-cuda-validator-2jnxd 0/1 Completed 0 5h50m 20:13:48 nvidia-cuda-validator-nsf98 0/1 Completed 0 5h51m ... 21:11:34 ERROR: Timeout reached while waiting for gpu operator ``` A simple call to `oc wait --for=condition=ready pod -l app=$pod_label` within the function should resolve it. Signed-off-by: manosnoam <[email protected]>

manosnoam · 2024-01-09T21:07:11Z

This has been tested on smoke test 4285.
Console output:

22:14:21  Deploying Nvidia GPU Operator
[Pipeline] sh
22:14:22  + /home/jenkins/workspace/rhods/rhods-smoke/ods-ci/ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh
22:14:27  namespace/nvidia-gpu-operator unchanged
22:14:27  operatorgroup.operators.coreos.com/nvidia-gpu-operator-group unchanged
22:14:27  subscription.operators.coreos.com/gpu-operator-certified unchanged
22:14:27  subscription.operators.coreos.com/nfd unchanged
22:14:27  Waiting until GPU pods of 'gpu-operator' in namespace 'nvidia-gpu-operator' are in running state...
22:14:28  pod/gpu-operator-6559b97ffb-v5lnt condition met
22:14:30  nodefeaturediscovery.nfd.openshift.io/nfd-instance configured
22:14:34  clusterpolicy.nvidia.com/gpu-cluster-policy unchanged
22:14:34  Waiting until GPU pods of 'nvidia-device-plugin-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:14:35  pod/nvidia-device-plugin-daemonset-kdbl9 condition met
22:14:35  pod/nvidia-device-plugin-daemonset-zgj2t condition met
22:14:35  Waiting until GPU pods of 'nvidia-container-toolkit-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:14:36  pod/nvidia-container-toolkit-daemonset-btkjh condition met
22:14:36  pod/nvidia-container-toolkit-daemonset-qtq6c condition met
22:14:36  Waiting until GPU pods of 'nvidia-dcgm-exporter' in namespace 'nvidia-gpu-operator' are in running state...
22:14:37  pod/nvidia-dcgm-exporter-q2snc condition met
22:14:37  pod/nvidia-dcgm-exporter-r8rfc condition met
22:14:37  Waiting until GPU pods of 'gpu-feature-discovery' in namespace 'nvidia-gpu-operator' are in running state...
22:14:38  pod/gpu-feature-discovery-b8xdc condition met
22:14:38  pod/gpu-feature-discovery-hwjx5 condition met
22:14:38  Waiting until GPU pods of 'nvidia-operator-validator' in namespace 'nvidia-gpu-operator' are in running state...
22:14:39  pod/nvidia-operator-validator-r4mm8 condition met
22:14:39  pod/nvidia-operator-validator-vwl76 condition met
22:14:39  Deleting configmap migration-gpu-status
22:14:40  configmap "migration-gpu-status" deleted
22:14:40  Rollout restart rhods-dashboard deployment
22:14:41  deployment.apps/rhods-dashboard restarted
22:14:41  Waiting for up to 3 minutes until rhods-dashboard deployment is rolled out
22:14:41  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 4 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 4 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:16:15  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:16:15  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:16:15  Waiting for deployment "rhods-dashboard" rollout to finish: 4 of 5 updated replicas are available...
22:16:42  deployment "rhods-dashboard" successfully rolled out
22:16:42  Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications
22:16:43  Name:         migrated-gpu
22:16:43  Namespace:    redhat-ods-applications
22:16:43  Labels:       <none>
22:16:43  Annotations:  <none>
22:16:43  API Version:  dashboard.opendatahub.io/v1
22:16:43  Kind:         AcceleratorProfile
22:16:43  Metadata:
22:16:43    Creation Timestamp:  2024-01-09T15:46:11Z
22:16:43    Generation:          1
22:16:43    Managed Fields:
22:16:43      API Version:  dashboard.opendatahub.io/v1
22:16:43      Fields Type:  FieldsV1
22:16:43      fieldsV1:
22:16:43        f:spec:
22:16:43          .:
22:16:43          f:displayName:
22:16:43          f:enabled:
22:16:43          f:identifier:
22:16:43          f:tolerations:
22:16:43      Manager:         unknown
22:16:43      Operation:       Update
22:16:43      Time:            2024-01-09T15:46:11Z
22:16:43    Resource Version:  2416221
22:16:43    UID:               ce19e4c7-97eb-46b6-b2f4-cc719e0d8e1b
22:16:43  Spec:
22:16:43    Display Name:  NVIDIA GPU
22:16:43    Enabled:       true
22:16:43    Identifier:    nvidia.com/gpu
22:16:43    Tolerations:
22:16:43      Effect:    NoSchedule
22:16:43      Key:       nvidia.com/gpu
22:16:43      Operator:  Exists
22:16:43  Events:        <none>

sonarqubecloud · 2024-01-09T21:08:56Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

apodhrad · 2024-01-10T07:54:25Z

I approve this since this PR doesn't break anything and doesn't add extra additional time. It might be useful for debugging.

jstourac · 2024-01-10T08:46:20Z

Note: bunch of those shellcheck warnings could be easily fixed.

manosnoam · 2024-01-10T09:58:07Z

Note: bunch of those shellcheck warnings could be easily fixed.

@jstourac oh, I missed your suggestion, since the spell check warnings did not break the Quality Gate...
Maybe we should enforce it in the gating.

manosnoam requested review from FedeAlonso and bdattoma January 9, 2024 16:15

FedeAlonso requested changes Jan 9, 2024

View reviewed changes

manosnoam requested a review from apodhrad January 9, 2024 17:06

manosnoam added verified This PR has been tested with Jenkins new test New test(s) added (PR will be listed in release-notes) labels Jan 9, 2024

Merge branch 'master' into fix_gpu

41c0014

apodhrad approved these changes Jan 10, 2024

View reviewed changes

FedeAlonso approved these changes Jan 10, 2024

View reviewed changes

bdattoma approved these changes Jan 10, 2024

View reviewed changes

manosnoam merged commit 6772300 into red-hat-data-services:master Jan 10, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify GPU AcceleratorProfile is created after restarting Dashboard #1103

Verify GPU AcceleratorProfile is created after restarting Dashboard #1103

manosnoam commented Jan 9, 2024 •

edited

Loading

github-actions bot commented Jan 9, 2024

FedeAlonso Jan 9, 2024

bdattoma Jan 9, 2024

manosnoam Jan 9, 2024

manosnoam commented Jan 9, 2024

sonarqubecloud bot commented Jan 9, 2024

apodhrad commented Jan 10, 2024

jstourac commented Jan 10, 2024

manosnoam commented Jan 10, 2024

Verify GPU AcceleratorProfile is created after restarting Dashboard #1103

Verify GPU AcceleratorProfile is created after restarting Dashboard #1103

Conversation

manosnoam commented Jan 9, 2024 • edited Loading

github-actions bot commented Jan 9, 2024

Robot Results

FedeAlonso Jan 9, 2024

Choose a reason for hiding this comment

bdattoma Jan 9, 2024

Choose a reason for hiding this comment

manosnoam Jan 9, 2024

Choose a reason for hiding this comment

manosnoam commented Jan 9, 2024

sonarqubecloud bot commented Jan 9, 2024

Quality Gate passed

apodhrad commented Jan 10, 2024

jstourac commented Jan 10, 2024

manosnoam commented Jan 10, 2024

manosnoam commented Jan 9, 2024 •

edited

Loading