-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify GPU AcceleratorProfile is created after restarting Dashboard #1103
Conversation
Add to gpu_deploy.sh script two verification steps after restarting rhods-dashboard deployment: - Wait for up to 3 minutes until the deployment is rolled out - Verify that an AcceleratorProfiles resource is created Signed-off-by: manosnoam <[email protected]>
Robot Results
|
oc rollout status deployment.apps/rhods-dashboard -n redhat-ods-applications --watch --timeout 3m | ||
|
||
echo "Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications" | ||
oc describe AcceleratorProfiles -n redhat-ods-applications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doing a describe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to check the instance was created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe checks if the resource was created + describes it, for example:
$▶ oc describe AcceleratorProfiles -n redhat-ods-applications
Name: migrated-gpu
Namespace: redhat-ods-applications
Labels: <none>
Annotations: <none>
API Version: dashboard.opendatahub.io/v1
Kind: AcceleratorProfile
Metadata:
Creation Timestamp: 2024-01-09T15:46:11Z
Generation: 1
Managed Fields:
API Version: dashboard.opendatahub.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:displayName:
f:enabled:
f:identifier:
f:tolerations:
Manager: unknown
Operation: Update
Time: 2024-01-09T15:46:11Z
Resource Version: 2416221
UID: ce19e4c7-97eb-46b6-b2f4-cc719e0d8e1b
Spec:
Display Name: NVIDIA GPU
Enabled: true
Identifier: nvidia.com/gpu
Tolerations:
Effect: NoSchedule
Key: nvidia.com/gpu
Operator: Exists
Events: <none>
If the resource is missing it will fail with:
error: the server doesn't have a resource type "AcceleratorProfiles"
The previous implementation of the function did not handle completed pods, so the script hanged for ~45 minutes waiting for completed pods (e.g. nvidia-cuda-validator pods) to become running: ``` 19:24:53 GPU installation seems to be still running ... 20:13:48 nvidia-cuda-validator-2jnxd 0/1 Completed 0 5h50m 20:13:48 nvidia-cuda-validator-nsf98 0/1 Completed 0 5h51m ... 21:11:34 ERROR: Timeout reached while waiting for gpu operator ``` A simple call to `oc wait --for=condition=ready pod -l app=$pod_label` within the function should resolve it. Signed-off-by: manosnoam <[email protected]>
This has been tested on smoke test 4285.
|
|
I approve this since this PR doesn't break anything and doesn't add extra additional time. It might be useful for debugging. |
Note: bunch of those shellcheck warnings could be easily fixed. |
@jstourac oh, I missed your suggestion, since the spell check warnings did not break the Quality Gate... |
Add to gpu_deploy.sh script two verification steps after restarting rhods-dashboard deployment.
Notice that the wait for pods running happens before the deployment is restarted.
That is why an additional wait for rollout is necessary to be called after restarting the deployment.