Define a new EDPM role for installing nvidia driver on nodes #2637

sbauza · 2025-01-09T13:34:26Z

In order to do nvidia vGPU testing, we need to deploy a specific package on EDPM nodes and do some other actions.
This is a multi-stage role requring a reboot between the two stages hence the two defined phases.

openshift-ci · 2025-01-09T13:34:38Z

Hi @sbauza. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-actions · 2025-01-09T13:34:38Z

Thanks for the PR! ❤️
I'm marking it as a draft, once your happy with it merging and the PR is passing CI, click the "Ready for review" button below.

softwarefactory-project-zuul · 2025-01-09T15:15:08Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c18f1379352942ab958539d670a22b47

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 30m 40s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 17m 57s
❌ cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 27m 37s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 48s
❌ cifmw-pod-pre-commit FAILURE in 5m 51s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 08s
✔️ build-push-container-cifmw-client SUCCESS in 36m 27s
❌ cifmw-molecule-edpm_nvidia_mdev_prepare FAILURE in 4m 15s

sbauza · 2025-01-10T09:12:13Z

recheck

SeanMooney · 2025-01-10T15:56:39Z

/ok-to-test

sbauza · 2025-01-10T15:58:54Z

recheck

hjensas · 2025-01-14T10:14:08Z

How would customers do this?
Would they not use a custom service interface, or the bootstrap command interface so that this type of package install is handled using the products interfaces?

Here is an example custom service for repo setup: https://github.com/openstack-k8s-operators/install_yamls/blob/main/devsetup/edpm/services/dataplane_v1beta1_openstackdataplaneservice_reposetup.yaml

At first glance, I am very much against this change. We should use the products features to do this kind of thing.

@slagle - wdyt?

slagle · 2025-01-14T12:40:21Z

How would customers do this? Would they not use a custom service interface, or the bootstrap command interface so that this type of package install is handled using the products interfaces?

Here is an example custom service for repo setup: https://github.com/openstack-k8s-operators/install_yamls/blob/main/devsetup/edpm/services/dataplane_v1beta1_openstackdataplaneservice_reposetup.yaml

At first glance, I am very much against this change. We should use the products features to do this kind of thing.

@slagle - wdyt?

I think we'd expect customers to use the product interfaces for this type of thing. If they can't or need to run ad-hoc ansible for some reason, we want to understand that use case to see if it's something we want to add to the product.

Looking at some of the tasks (such as "Regenerate initramfs" and "Create a systemd unit file that will enable SRIOV VFs"), aren't these things that a customer would need to do as well when installing this driver?

Do we intend to manually document these steps, or do we intend to provide supported ansible content that customers can use? If it's the ansible content, then that should be in edpm-ansible, with an enabling service in openstack-operator.

However, if we intend to only support the driver itself, and not any of the installation steps, then I suppose ci-framework can use ad-hoc methods to put it in place. That does seem more error prone though.

sbauza · 2025-01-14T13:47:52Z

I think we'd expect customers to use the product interfaces for this type of thing. If they can't or need to run ad-hoc ansible for some reason, we want to understand that use case to see if it's something we want to add to the product.

Looking at some of the tasks (such as "Regenerate initramfs" and "Create a systemd unit file that will enable SRIOV VFs"), aren't these things that a customer would need to do as well when installing this driver?

Do we intend to manually document these steps, or do we intend to provide supported ansible content that customers can use? If it's the ansible content, then that should be in edpm-ansible, with an enabling service in openstack-operator.

However, if we intend to only support the driver itself, and not any of the installation steps, then I suppose ci-framework can use ad-hoc methods to put it in place. That does seem more error prone though.

I probably didn't explained correctly in the PR comment but this role wouldn't be used by our customers directly as we already document how to support vGPUs in RHOSO18 but would be rather used by our downstream CI jobs for testing this vGPU feature. In order to test that, we need to install a proprietary nvidia driver in the host but we can't use a specific role for installing that as it needs some other checks (like disabling the nouveau driver). This is why I created this role which is already tested by another internal job.

HTH.

jamepark4 · 2025-01-14T16:17:55Z

As @sbauza mentioned we provide documentation with guidelines for installing Nvidia drivers. Since we can't cover all of the potential procedures for Nvidia GPU installation, we tell customers to refer to Nvidia's documentation based on their GPU/Driver(s) needs. Because of this, there are no plans to create dedicated roles in edpm to handle GPU driver installation and setup. This PR is not intended to be leveraged by customers and is purely for the GPUs in our downstream jobs.

hjensas · 2025-01-15T10:15:14Z

@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too?

My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces.

I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used.

[1] https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/assembly_customizing-the-data-plane#con_data-plane-services_custom_dataplane

SeanMooney · 2025-01-15T11:58:29Z

@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too?

My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces.

I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used.

[1] https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/assembly_customizing-the-data-plane#con_data-plane-services_custom_dataplane

we are doing this in the ci framework repo because we cannot support automated installation in production. we have discuss this in the past but the there are business and leagaage reasons why we cant include the driver or automate the download.

As with ceph the customistaion fo the EDPM node to install the nvidia drivers will be left to the customer to download form the nvida portal themselves and install.

it was an intentional decision not to use the ability to define custom dataplane service to do this, it is possible to do but not something redhat can support in a customer env.

in past releases we have worked around the installation of the driver in ci by building a overcloud image to use for GPU testing with the content pre installed.
that is not a workflow we are really supporting with the new installer which is why this is being done as a post host porvisoing role.

hjensas · 2025-01-15T12:16:38Z

@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too?
My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces.
I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used.
[1] https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/assembly_customizing-the-data-plane#con_data-plane-services_custom_dataplane

we are doing this in the ci framework repo because we cannot support automated installation in production. we have discuss this in the past but the there are business and leagaage reasons why we cant include the driver or automate the download.

As with ceph the customistaion fo the EDPM node to install the nvidia drivers will be left to the customer to download form the nvida portal themselves and install.

I understand that.

it was an intentional decision not to use the ability to define custom dataplane service to do this, it is possible to do but not something redhat can support in a customer env.

I am not going to continue the argument. :)
But the fact that there is choice made here to not utilize a custom dataplane service makes me wonder if the custom service interface being hard to use, or if ci-framework is lacking something to make it easy to add a custom dataplane service.

hjensas

/lgtm

SeanMooney · 2025-01-15T12:43:26Z

a custom dataplen service was not used because we did not want to build a custom ansible runner image and we did not want to put the playbook inline.

we also wanted to follow the patten used for ceph and the existing hci_prepare role.

galaxy.yml

playbooks/nvidia-mdev-phase1.yml

roles/edpm_nvidia_mdev_prepare/README.md

roles/edpm_nvidia_mdev_prepare/tasks/phase1.yml

pablintino · 2025-01-27T10:07:24Z

@sbauza Hi, I'm ok with your answers, let me know when you have the patch ready with the content you agreed to change and I'll give my approval.

openshift-ci · 2025-02-11T10:34:45Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2025-02-11T10:34:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bshewale for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbauza

changes are made based on your comments

softwarefactory-project-zuul · 2025-02-11T12:13:11Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/65c096a75be94e539530c85cadf0b4ad

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 37m 26s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 57s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 21m 50s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 58s
❌ cifmw-pod-pre-commit FAILURE in 7m 15s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 31s
✔️ build-push-container-cifmw-client SUCCESS in 18m 36s
✔️ cifmw-molecule-edpm_nvidia_mdev_prepare SUCCESS in 8m 01s

softwarefactory-project-zuul · 2025-02-11T16:05:08Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/ce87840f34ca4773a0660edafd3cfc67

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 45m 50s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 12m 34s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 30m 33s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 25s
❌ cifmw-pod-pre-commit FAILURE in 7m 50s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 46s
✔️ build-push-container-cifmw-client SUCCESS in 18m 16s
✔️ cifmw-molecule-edpm_nvidia_mdev_prepare SUCCESS in 8m 06s

In order to do nvidia vGPU testing, we need to deploy a specific package on EDPM nodes and do some other actions. This is a multi-stage role requring a reboot between the two stages hence the two defined phases.

softwarefactory-project-zuul · 2025-02-11T18:17:10Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2ed42f7ca65241e89c61c6e152ec1499

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 43m 21s
❌ podified-multinode-edpm-deployment-crc FAILURE in 1h 11m 43s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 29m 31s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 29s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 27s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 49s
✔️ build-push-container-cifmw-client SUCCESS in 23m 00s
✔️ cifmw-molecule-edpm_nvidia_mdev_prepare SUCCESS in 6m 59s

lewisdenny

Really nice work on this role, nice and clean, thank you for the molecule test and nice README <3

lewisdenny · 2025-02-13T03:30:17Z

playbooks/nvidia-mdev-phase1.yml

+        PATH: "{{ cifmw_path }}"
+      ansible.builtin.command:
+        cmd: >-
+          oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}"  -o yaml


Suggested change

oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml

oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml

lewisdenny · 2025-02-13T03:36:07Z

roles/edpm_nvidia_mdev_prepare/tasks/phase1.yml

+    - cifmw_edpm_nvidia_mdev_prepare_disable_nouveau | bool
+  register: _blacklist_nouveau
+
+- name: Make sure that we defined the driver URL


I would move this task to be the first that runs(above "Blacklist nouveau"), we should do our asserts at the beginning of the taskfile

lewisdenny · 2025-02-13T03:39:07Z

roles/edpm_nvidia_mdev_prepare/tasks/phase1.yml

+        path: /var/lib/openstack/reboot_required/
+        state: directory
+        mode: "0755"
+    - name: Create required file to enforce a reboot


Please add a newline between these two tasks

lewisdenny · 2025-02-13T03:41:22Z

roles/edpm_nvidia_mdev_prepare/files/.gitkeep

Please delete

lewisdenny · 2025-02-13T03:41:57Z

roles/edpm_nvidia_mdev_prepare/handlers/main.yml

Please delete

lewisdenny · 2025-02-13T03:43:06Z

roles/edpm_nvidia_mdev_prepare/tasks/cleanup.yml

Please delete

lewisdenny · 2025-02-13T03:44:20Z

roles/edpm_nvidia_mdev_prepare/tasks/main.yml

I would either delete this file or add a single debug task that explains this role should not be ran standalone.

lewisdenny · 2025-02-13T03:44:41Z

roles/edpm_nvidia_mdev_prepare/vars/main.yml

Please delete

sbauza requested review from pablintino and lewisdenny as code owners January 9, 2025 13:34

openshift-ci bot added the needs-ok-to-test label Jan 9, 2025

github-actions bot marked this pull request as draft January 9, 2025 13:34

openshift-ci bot added the do-not-merge/work-in-progress label Jan 9, 2025

sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 33d9265 to 12dcdb9 Compare January 9, 2025 15:28

openshift-ci bot added the do-not-merge/contains-merge-commits label Jan 10, 2025

sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 4f89d15 to 7c9fa6e Compare January 10, 2025 09:11

openshift-ci bot removed the do-not-merge/contains-merge-commits label Jan 10, 2025

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels Jan 10, 2025

sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 7c9fa6e to a8c455a Compare January 13, 2025 08:44

sbauza marked this pull request as ready for review January 13, 2025 13:55

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 13, 2025

github-actions bot added the Ready For Review label Jan 13, 2025

hjensas reviewed Jan 15, 2025

View reviewed changes

openshift-ci bot assigned hjensas Jan 15, 2025

openshift-ci bot added the lgtm label Jan 15, 2025

pablintino requested changes Jan 17, 2025

View reviewed changes

sbauza force-pushed the edpm_nvidia_mdev_prepare branch from a8c455a to 2e01023 Compare February 11, 2025 10:34

sbauza requested review from evallesp, bshewale and frenzyfriday as code owners February 11, 2025 10:34

openshift-ci bot removed the lgtm label Feb 11, 2025

github-actions bot removed the Ready For Review label Feb 11, 2025

sbauza requested a review from pablintino February 11, 2025 10:38

sbauza commented Feb 11, 2025

View reviewed changes

sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 2e01023 to 195b431 Compare February 11, 2025 14:17

Define a new EDPM role for installing nvidia driver on nodes

9b0d64d

In order to do nvidia vGPU testing, we need to deploy a specific package on EDPM nodes and do some other actions. This is a multi-stage role requring a reboot between the two stages hence the two defined phases.

sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 195b431 to 9b0d64d Compare February 11, 2025 16:32

lewisdenny requested changes Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a new EDPM role for installing nvidia driver on nodes #2637

Define a new EDPM role for installing nvidia driver on nodes #2637

sbauza commented Jan 9, 2025

openshift-ci bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

softwarefactory-project-zuul bot commented Jan 9, 2025

sbauza commented Jan 10, 2025

SeanMooney commented Jan 10, 2025

sbauza commented Jan 10, 2025

hjensas commented Jan 14, 2025

slagle commented Jan 14, 2025

sbauza commented Jan 14, 2025

jamepark4 commented Jan 14, 2025

hjensas commented Jan 15, 2025

SeanMooney commented Jan 15, 2025

hjensas commented Jan 15, 2025 •

edited

Loading

hjensas left a comment

SeanMooney commented Jan 15, 2025

pablintino commented Jan 27, 2025

openshift-ci bot commented Feb 11, 2025

openshift-ci bot commented Feb 11, 2025

sbauza left a comment

softwarefactory-project-zuul bot commented Feb 11, 2025

softwarefactory-project-zuul bot commented Feb 11, 2025

softwarefactory-project-zuul bot commented Feb 11, 2025

lewisdenny left a comment

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

lewisdenny Feb 13, 2025

	oc get OpenStackBaremetalSet -n "{{ namespace\|default('openstack') }}" -o yaml
	oc get OpenStackBaremetalSet -n "{{ namespace\|default('openstack') }}" -o yaml

Define a new EDPM role for installing nvidia driver on nodes #2637

Are you sure you want to change the base?

Define a new EDPM role for installing nvidia driver on nodes #2637

Conversation

sbauza commented Jan 9, 2025

openshift-ci bot commented Jan 9, 2025

github-actions bot commented Jan 9, 2025

softwarefactory-project-zuul bot commented Jan 9, 2025

sbauza commented Jan 10, 2025

SeanMooney commented Jan 10, 2025

sbauza commented Jan 10, 2025

hjensas commented Jan 14, 2025

slagle commented Jan 14, 2025

sbauza commented Jan 14, 2025

jamepark4 commented Jan 14, 2025

hjensas commented Jan 15, 2025

SeanMooney commented Jan 15, 2025

hjensas commented Jan 15, 2025 • edited Loading

hjensas left a comment

Choose a reason for hiding this comment

SeanMooney commented Jan 15, 2025

pablintino commented Jan 27, 2025

openshift-ci bot commented Feb 11, 2025

openshift-ci bot commented Feb 11, 2025

sbauza left a comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Feb 11, 2025

softwarefactory-project-zuul bot commented Feb 11, 2025

softwarefactory-project-zuul bot commented Feb 11, 2025

lewisdenny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjensas commented Jan 15, 2025 •

edited

Loading