Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify GPU AcceleratorProfile is created after restarting Dashboard #1103

Merged
merged 3 commits into from
Jan 10, 2024

Conversation

manosnoam
Copy link
Contributor

@manosnoam manosnoam commented Jan 9, 2024

Add to gpu_deploy.sh script two verification steps after restarting rhods-dashboard deployment.

  1. First wait for all relevant pods to be running.
  2. Then delete configmap.
  3. Then rollout restart rhods-dashboard deployment.
  4. Then wait for rollout status and replicas to be successful.
  5. At last verify that an AcceleratorProfiles resource is created.

Notice that the wait for pods running happens before the deployment is restarted.
That is why an additional wait for rollout is necessary to be called after restarting the deployment.

Add to gpu_deploy.sh script two verification steps after restarting
rhods-dashboard deployment:
- Wait for up to 3 minutes until the deployment is rolled out
- Verify that an AcceleratorProfiles resource is created

Signed-off-by: manosnoam <[email protected]>
Copy link
Contributor

github-actions bot commented Jan 9, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
393 0 0 393 100

oc rollout status deployment.apps/rhods-dashboard -n redhat-ods-applications --watch --timeout 3m

echo "Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications"
oc describe AcceleratorProfiles -n redhat-ods-applications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doing a describe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to check the instance was created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe checks if the resource was created + describes it, for example:

$▶ oc describe AcceleratorProfiles -n redhat-ods-applications
Name:         migrated-gpu
Namespace:    redhat-ods-applications
Labels:       <none>
Annotations:  <none>
API Version:  dashboard.opendatahub.io/v1
Kind:         AcceleratorProfile
Metadata:
  Creation Timestamp:  2024-01-09T15:46:11Z
  Generation:          1
  Managed Fields:
    API Version:  dashboard.opendatahub.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:displayName:
        f:enabled:
        f:identifier:
        f:tolerations:
    Manager:         unknown
    Operation:       Update
    Time:            2024-01-09T15:46:11Z
  Resource Version:  2416221
  UID:               ce19e4c7-97eb-46b6-b2f4-cc719e0d8e1b
Spec:
  Display Name:  NVIDIA GPU
  Enabled:       true
  Identifier:    nvidia.com/gpu
  Tolerations:
    Effect:    NoSchedule
    Key:       nvidia.com/gpu
    Operator:  Exists
Events:        <none>

If the resource is missing it will fail with:
error: the server doesn't have a resource type "AcceleratorProfiles"

@manosnoam manosnoam requested a review from apodhrad January 9, 2024 17:06
The previous implementation of the function did not handle completed
pods, so the script hanged for ~45 minutes waiting for completed pods
(e.g. nvidia-cuda-validator pods) to become running:
```
19:24:53  GPU installation seems to be still running
...
20:13:48  nvidia-cuda-validator-2jnxd  0/1  Completed  0  5h50m
20:13:48  nvidia-cuda-validator-nsf98  0/1  Completed  0  5h51m
...
21:11:34  ERROR: Timeout reached while waiting for gpu operator
```

A simple call to `oc wait --for=condition=ready pod -l app=$pod_label`
within the function should resolve it.

Signed-off-by: manosnoam <[email protected]>
@manosnoam manosnoam added verified This PR has been tested with Jenkins new test New test(s) added (PR will be listed in release-notes) labels Jan 9, 2024
@manosnoam
Copy link
Contributor Author

This has been tested on smoke test 4285.
Console output:

22:14:21  Deploying Nvidia GPU Operator
[Pipeline] sh
22:14:22  + /home/jenkins/workspace/rhods/rhods-smoke/ods-ci/ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh
22:14:27  namespace/nvidia-gpu-operator unchanged
22:14:27  operatorgroup.operators.coreos.com/nvidia-gpu-operator-group unchanged
22:14:27  subscription.operators.coreos.com/gpu-operator-certified unchanged
22:14:27  subscription.operators.coreos.com/nfd unchanged
22:14:27  Waiting until GPU pods of 'gpu-operator' in namespace 'nvidia-gpu-operator' are in running state...
22:14:28  pod/gpu-operator-6559b97ffb-v5lnt condition met
22:14:30  nodefeaturediscovery.nfd.openshift.io/nfd-instance configured
22:14:34  clusterpolicy.nvidia.com/gpu-cluster-policy unchanged
22:14:34  Waiting until GPU pods of 'nvidia-device-plugin-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:14:35  pod/nvidia-device-plugin-daemonset-kdbl9 condition met
22:14:35  pod/nvidia-device-plugin-daemonset-zgj2t condition met
22:14:35  Waiting until GPU pods of 'nvidia-container-toolkit-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:14:36  pod/nvidia-container-toolkit-daemonset-btkjh condition met
22:14:36  pod/nvidia-container-toolkit-daemonset-qtq6c condition met
22:14:36  Waiting until GPU pods of 'nvidia-dcgm-exporter' in namespace 'nvidia-gpu-operator' are in running state...
22:14:37  pod/nvidia-dcgm-exporter-q2snc condition met
22:14:37  pod/nvidia-dcgm-exporter-r8rfc condition met
22:14:37  Waiting until GPU pods of 'gpu-feature-discovery' in namespace 'nvidia-gpu-operator' are in running state...
22:14:38  pod/gpu-feature-discovery-b8xdc condition met
22:14:38  pod/gpu-feature-discovery-hwjx5 condition met
22:14:38  Waiting until GPU pods of 'nvidia-operator-validator' in namespace 'nvidia-gpu-operator' are in running state...
22:14:39  pod/nvidia-operator-validator-r4mm8 condition met
22:14:39  pod/nvidia-operator-validator-vwl76 condition met
22:14:39  Deleting configmap migration-gpu-status
22:14:40  configmap "migration-gpu-status" deleted
22:14:40  Rollout restart rhods-dashboard deployment
22:14:41  deployment.apps/rhods-dashboard restarted
22:14:41  Waiting for up to 3 minutes until rhods-dashboard deployment is rolled out
22:14:41  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 4 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 4 out of 5 new replicas have been updated...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:15:49  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:16:15  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:16:15  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:16:15  Waiting for deployment "rhods-dashboard" rollout to finish: 4 of 5 updated replicas are available...
22:16:42  deployment "rhods-dashboard" successfully rolled out
22:16:42  Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications
22:16:43  Name:         migrated-gpu
22:16:43  Namespace:    redhat-ods-applications
22:16:43  Labels:       <none>
22:16:43  Annotations:  <none>
22:16:43  API Version:  dashboard.opendatahub.io/v1
22:16:43  Kind:         AcceleratorProfile
22:16:43  Metadata:
22:16:43    Creation Timestamp:  2024-01-09T15:46:11Z
22:16:43    Generation:          1
22:16:43    Managed Fields:
22:16:43      API Version:  dashboard.opendatahub.io/v1
22:16:43      Fields Type:  FieldsV1
22:16:43      fieldsV1:
22:16:43        f:spec:
22:16:43          .:
22:16:43          f:displayName:
22:16:43          f:enabled:
22:16:43          f:identifier:
22:16:43          f:tolerations:
22:16:43      Manager:         unknown
22:16:43      Operation:       Update
22:16:43      Time:            2024-01-09T15:46:11Z
22:16:43    Resource Version:  2416221
22:16:43    UID:               ce19e4c7-97eb-46b6-b2f4-cc719e0d8e1b
22:16:43  Spec:
22:16:43    Display Name:  NVIDIA GPU
22:16:43    Enabled:       true
22:16:43    Identifier:    nvidia.com/gpu
22:16:43    Tolerations:
22:16:43      Effect:    NoSchedule
22:16:43      Key:       nvidia.com/gpu
22:16:43      Operator:  Exists
22:16:43  Events:        <none>

Copy link

sonarqubecloud bot commented Jan 9, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@apodhrad
Copy link
Contributor

I approve this since this PR doesn't break anything and doesn't add extra additional time. It might be useful for debugging.

@jstourac
Copy link
Member

Note: bunch of those shellcheck warnings could be easily fixed.

@manosnoam manosnoam merged commit 6772300 into red-hat-data-services:master Jan 10, 2024
11 checks passed
@manosnoam
Copy link
Contributor Author

Note: bunch of those shellcheck warnings could be easily fixed.

@jstourac oh, I missed your suggestion, since the spell check warnings did not break the Quality Gate...
Maybe we should enforce it in the gating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new test New test(s) added (PR will be listed in release-notes) verified This PR has been tested with Jenkins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants