Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for GPU Operator Subscription, InstallPlan & Deployment #1108

Merged
merged 3 commits into from
Jan 15, 2024

Conversation

manosnoam
Copy link
Contributor

To avoid situation where gpu_deploy.sh waits for nvidia-gpu-operator pods,
before they were created by the Nvidia GPU Operator installation,
it is required to initially wait for the Operator Subscription, InstallPlan & Deployment to complete.

Copy link
Contributor

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass %
397 0 0 397 100

FedeAlonso
FedeAlonso previously approved these changes Jan 11, 2024
@manosnoam
Copy link
Contributor Author

manosnoam commented Jan 11, 2024

This has been tested both on Cluster without Nvidia operator, and also when the operator was already installed.

Output example when running the commands on a cluster without the Operator:

22:05:50  Deploying Nvidia GPU Operator
[Pipeline] sh
22:05:51  + /home/jenkins/workspace/rhods/rhods-smoke/ods-ci/ods_ci/tasks/Resources/Provisioning/GPU/gpu_deploy.sh
22:05:51  Create and apply 'gpu_install.yaml' to install Nvidia GPU Operator
22:05:53  namespace/nvidia-gpu-operator unchanged
22:05:53  operatorgroup.operators.coreos.com/nvidia-gpu-operator-group created
22:05:54  subscription.operators.coreos.com/gpu-operator-certified created
22:05:54  subscription.operators.coreos.com/nfd created
22:05:54  Wait for Nvidia GPU Operator Subscription, InstallPlan and Deployment to complete
22:06:20  subscription.operators.coreos.com/nfd condition met
22:06:20  subscription.operators.coreos.com/gpu-operator-certified condition met
22:06:38  installplan.operators.coreos.com/install-r7ghv condition met
22:06:38  installplan.operators.coreos.com/install-xtsqx condition met
22:06:38  Waiting for deployment "gpu-operator" rollout to finish: 1 old replicas are pending termination...
22:06:41  Waiting for deployment "gpu-operator" rollout to finish: 1 old replicas are pending termination...
22:06:41  deployment "gpu-operator" successfully rolled out
22:06:41  deployment "nfd-controller-manager" successfully rolled out
22:06:42  operator.operators.coreos.com/nfd.nvidia-gpu-operator condition met
22:06:42  operator.operators.coreos.com/gpu-operator-certified.nvidia-gpu-operator condition met
22:06:43  gpu-operator-694b7778cb-bxm2c   1/1   Terminating   0     30s
22:06:43  gpu-operator-c66699dfb-g4m7q    1/1   Running       0     13s
22:06:43  Waiting until GPU pods of 'gpu-operator' in namespace 'nvidia-gpu-operator' are in running state...
22:06:43  pod/gpu-operator-694b7778cb-bxm2c condition met
22:06:43  pod/gpu-operator-c66699dfb-g4m7q condition met
22:06:49  nodefeaturediscovery.nfd.openshift.io/nfd-instance created
22:06:54  clusterpolicy.nvidia.com/gpu-cluster-policy configured
22:06:54  
22:06:54  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:00  
22:07:00  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:06  
22:07:06  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:11  
22:07:11  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:16  
22:07:16  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:21  
22:07:21  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:26  
22:07:26  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:32  
22:07:32  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:37  
22:07:37  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:43  
22:07:43  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:49  
22:07:49  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:54  
22:07:54  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:07:59  
22:07:59  Waiting for pod with label app=nvidia-device-plugin-daemonset to be present...
22:08:04  nvidia-device-plugin-daemonset-fp7dk   0/1   Init:0/1   0     2s
22:08:04  Waiting until GPU pods of 'nvidia-device-plugin-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/nvidia-device-plugin-daemonset-fp7dk condition met
22:11:56  nvidia-container-toolkit-daemonset-4v4pj   1/1   Running   0     3m44s
22:11:56  Waiting until GPU pods of 'nvidia-container-toolkit-daemonset' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/nvidia-container-toolkit-daemonset-4v4pj condition met
22:11:56  nvidia-dcgm-exporter-z4cqs   1/1   Running   0     3m46s
22:11:56  Waiting until GPU pods of 'nvidia-dcgm-exporter' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/nvidia-dcgm-exporter-z4cqs condition met
22:11:56  gpu-feature-discovery-4p44f   1/1   Running   0     3m46s
22:11:56  Waiting until GPU pods of 'gpu-feature-discovery' in namespace 'nvidia-gpu-operator' are in running state...
22:11:56  pod/gpu-feature-discovery-4p44f condition met
22:11:56  nvidia-operator-validator-qcmd5   0/1   Init:3/4   2     3m47s
22:11:56  Waiting until GPU pods of 'nvidia-operator-validator' in namespace 'nvidia-gpu-operator' are in running state...
22:12:02  pod/nvidia-operator-validator-qcmd5 condition met
22:12:02  Deleting configmap migration-gpu-status
22:12:02  configmap "migration-gpu-status" deleted
22:12:02  Rollout restart rhods-dashboard deployment
22:12:02  deployment.apps/rhods-dashboard restarted
22:12:02  Waiting for up to 3 minutes until rhods-dashboard deployment is rolled out
22:12:03  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 3 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 4 out of 5 new replicas have been updated...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 2 old replicas are pending termination...
22:13:10  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:14:06  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:14:06  Waiting for deployment "rhods-dashboard" rollout to finish: 1 old replicas are pending termination...
22:14:06  Waiting for deployment "rhods-dashboard" rollout to finish: 4 of 5 updated replicas are available...
22:14:06  deployment "rhods-dashboard" successfully rolled out
22:14:06  Verifying that an AcceleratorProfiles resource was created in redhat-ods-applications
22:14:06  Name:         migrated-gpu
22:14:06  Namespace:    redhat-ods-applications
22:14:06  Labels:       <none>
22:14:06  Annotations:  <none>
22:14:06  API Version:  dashboard.opendatahub.io/v1
22:14:06  Kind:         AcceleratorProfile
22:14:06  Metadata:
22:14:06    Creation Timestamp:  2024-01-11T12:04:13Z
22:14:06    Generation:          1
22:14:06    Managed Fields:
22:14:06      API Version:  dashboard.opendatahub.io/v1
22:14:06      Fields Type:  FieldsV1
22:14:06      fieldsV1:
22:14:06        f:spec:
22:14:06          .:
22:14:06          f:displayName:
22:14:06          f:enabled:
22:14:06          f:identifier:
22:14:06          f:tolerations:
22:14:06      Manager:         unknown
22:14:06      Operation:       Update
22:14:06      Time:            2024-01-11T12:04:13Z
22:14:06    Resource Version:  296166
22:14:06    UID:               a0a2ff7c-efe4-4664-8d2b-b25051854493
22:14:06  Spec:
22:14:06    Display Name:  NVIDIA GPU
22:14:06    Enabled:       true
22:14:06    Identifier:    nvidia.com/gpu
22:14:06    Tolerations:
22:14:06      Effect:    NoSchedule
22:14:06      Key:       nvidia.com/gpu
22:14:06      Operator:  Exists
22:14:06  Events:        <none>


oc wait installplan -n nvidia-gpu-operator --all --for condition=Installed --timeout=3m

oc rollout status -n nvidia-gpu-operator deployment gpu-operator --watch --timeout=3m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so rollout is necessary. Also could you add the check for nvidia operator as well

Copy link
Contributor Author

@manosnoam manosnoam Jan 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the script to also wait for gpu-operator-certified subscription, and for the nfd-controller-manager deployment.
The rollout status is still needed, see the updated execution in my previous comment.
I also added wait for Operator's resource to exists, and the wait for pods as you suggested.

Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@manosnoam manosnoam added the verified This PR has been tested with Jenkins label Jan 14, 2024
@manosnoam manosnoam merged commit b0c8743 into red-hat-data-services:master Jan 15, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verified This PR has been tested with Jenkins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants