Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a new EDPM role for installing nvidia driver on nodes #2637

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sbauza
Copy link

@sbauza sbauza commented Jan 9, 2025

In order to do nvidia vGPU testing, we need to deploy a specific package on EDPM nodes and do some other actions.
This is a multi-stage role requring a reboot between the two stages hence the two defined phases.

Copy link
Contributor

openshift-ci bot commented Jan 9, 2025

Hi @sbauza. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

github-actions bot commented Jan 9, 2025

Thanks for the PR! ❤️
I'm marking it as a draft, once your happy with it merging and the PR is passing CI, click the "Ready for review" button below.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c18f1379352942ab958539d670a22b47

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 30m 40s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 17m 57s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 27m 37s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 48s
cifmw-pod-pre-commit FAILURE in 5m 51s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 08s
✔️ build-push-container-cifmw-client SUCCESS in 36m 27s
cifmw-molecule-edpm_nvidia_mdev_prepare FAILURE in 4m 15s

@sbauza
Copy link
Author

sbauza commented Jan 10, 2025

recheck

@SeanMooney
Copy link
Contributor

/ok-to-test

@sbauza
Copy link
Author

sbauza commented Jan 10, 2025

recheck

@sbauza sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 7c9fa6e to a8c455a Compare January 13, 2025 08:44
@sbauza sbauza marked this pull request as ready for review January 13, 2025 13:55
@hjensas
Copy link
Contributor

hjensas commented Jan 14, 2025

How would customers do this?
Would they not use a custom service interface, or the bootstrap command interface so that this type of package install is handled using the products interfaces?

Here is an example custom service for repo setup: https://github.com/openstack-k8s-operators/install_yamls/blob/main/devsetup/edpm/services/dataplane_v1beta1_openstackdataplaneservice_reposetup.yaml

At first glance, I am very much against this change. We should use the products features to do this kind of thing.

@slagle - wdyt?

@slagle
Copy link
Contributor

slagle commented Jan 14, 2025

How would customers do this? Would they not use a custom service interface, or the bootstrap command interface so that this type of package install is handled using the products interfaces?

Here is an example custom service for repo setup: https://github.com/openstack-k8s-operators/install_yamls/blob/main/devsetup/edpm/services/dataplane_v1beta1_openstackdataplaneservice_reposetup.yaml

At first glance, I am very much against this change. We should use the products features to do this kind of thing.

@slagle - wdyt?

I think we'd expect customers to use the product interfaces for this type of thing. If they can't or need to run ad-hoc ansible for some reason, we want to understand that use case to see if it's something we want to add to the product.

Looking at some of the tasks (such as "Regenerate initramfs" and "Create a systemd unit file that will enable SRIOV VFs"), aren't these things that a customer would need to do as well when installing this driver?

Do we intend to manually document these steps, or do we intend to provide supported ansible content that customers can use? If it's the ansible content, then that should be in edpm-ansible, with an enabling service in openstack-operator.

However, if we intend to only support the driver itself, and not any of the installation steps, then I suppose ci-framework can use ad-hoc methods to put it in place. That does seem more error prone though.

@sbauza
Copy link
Author

sbauza commented Jan 14, 2025

I think we'd expect customers to use the product interfaces for this type of thing. If they can't or need to run ad-hoc ansible for some reason, we want to understand that use case to see if it's something we want to add to the product.

Looking at some of the tasks (such as "Regenerate initramfs" and "Create a systemd unit file that will enable SRIOV VFs"), aren't these things that a customer would need to do as well when installing this driver?

Do we intend to manually document these steps, or do we intend to provide supported ansible content that customers can use? If it's the ansible content, then that should be in edpm-ansible, with an enabling service in openstack-operator.

However, if we intend to only support the driver itself, and not any of the installation steps, then I suppose ci-framework can use ad-hoc methods to put it in place. That does seem more error prone though.

I probably didn't explained correctly in the PR comment but this role wouldn't be used by our customers directly as we already document how to support vGPUs in RHOSO18 but would be rather used by our downstream CI jobs for testing this vGPU feature. In order to test that, we need to install a proprietary nvidia driver in the host but we can't use a specific role for installing that as it needs some other checks (like disabling the nouveau driver). This is why I created this role which is already tested by another internal job.

HTH.

@jamepark4
Copy link
Contributor

As @sbauza mentioned we provide documentation with guidelines for installing Nvidia drivers. Since we can't cover all of the potential procedures for Nvidia GPU installation, we tell customers to refer to Nvidia's documentation based on their GPU/Driver(s) needs. Because of this, there are no plans to create dedicated roles in edpm to handle GPU driver installation and setup. This PR is not intended to be leveraged by customers and is purely for the GPUs in our downstream jobs.

@hjensas
Copy link
Contributor

hjensas commented Jan 15, 2025

@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too?

My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces.

I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used.

[1] https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/assembly_customizing-the-data-plane#con_data-plane-services_custom_dataplane

@SeanMooney
Copy link
Contributor

@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too?

My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces.

I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used.

[1] https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/assembly_customizing-the-data-plane#con_data-plane-services_custom_dataplane

we are doing this in the ci framework repo because we cannot support automated installation in production. we have discuss this in the past but the there are business and leagaage reasons why we cant include the driver or automate the download.

As with ceph the customistaion fo the EDPM node to install the nvidia drivers will be left to the customer to download form the nvida portal themselves and install.

it was an intentional decision not to use the ability to define custom dataplane service to do this, it is possible to do but not something redhat can support in a customer env.

in past releases we have worked around the installation of the driver in ci by building a overcloud image to use for GPU testing with the content pre installed.
that is not a workflow we are really supporting with the new installer which is why this is being done as a post host porvisoing role.

@hjensas
Copy link
Contributor

hjensas commented Jan 15, 2025

@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too?
My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces.
I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used.
[1] https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/customizing_the_red_hat_openstack_services_on_openshift_deployment/assembly_customizing-the-data-plane#con_data-plane-services_custom_dataplane

we are doing this in the ci framework repo because we cannot support automated installation in production. we have discuss this in the past but the there are business and leagaage reasons why we cant include the driver or automate the download.

As with ceph the customistaion fo the EDPM node to install the nvidia drivers will be left to the customer to download form the nvida portal themselves and install.

I understand that.

it was an intentional decision not to use the ability to define custom dataplane service to do this, it is possible to do but not something redhat can support in a customer env.

I am not going to continue the argument. :)
But the fact that there is choice made here to not utilize a custom dataplane service makes me wonder if the custom service interface being hard to use, or if ci-framework is lacking something to make it easy to add a custom dataplane service.

Copy link
Contributor

@hjensas hjensas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 15, 2025
@SeanMooney
Copy link
Contributor

a custom dataplen service was not used because we did not want to build a custom ansible runner image and we did not want to put the playbook inline.

we also wanted to follow the patten used for ceph and the existing hci_prepare role.

galaxy.yml Outdated Show resolved Hide resolved
playbooks/nvidia-mdev-phase1.yml Outdated Show resolved Hide resolved
roles/edpm_nvidia_mdev_prepare/README.md Outdated Show resolved Hide resolved
roles/edpm_nvidia_mdev_prepare/tasks/phase1.yml Outdated Show resolved Hide resolved
@pablintino
Copy link
Collaborator

@sbauza Hi, I'm ok with your answers, let me know when you have the patch ready with the content you agreed to change and I'll give my approval.

@sbauza sbauza force-pushed the edpm_nvidia_mdev_prepare branch from a8c455a to 2e01023 Compare February 11, 2025 10:34
@openshift-ci openshift-ci bot removed the lgtm label Feb 11, 2025
Copy link
Contributor

openshift-ci bot commented Feb 11, 2025

New changes are detected. LGTM label has been removed.

Copy link
Contributor

openshift-ci bot commented Feb 11, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bshewale for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Author

@sbauza sbauza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes are made based on your comments

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/65c096a75be94e539530c85cadf0b4ad

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 37m 26s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 20m 57s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 21m 50s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 58s
cifmw-pod-pre-commit FAILURE in 7m 15s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 31s
✔️ build-push-container-cifmw-client SUCCESS in 18m 36s
✔️ cifmw-molecule-edpm_nvidia_mdev_prepare SUCCESS in 8m 01s

@sbauza sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 2e01023 to 195b431 Compare February 11, 2025 14:17
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/ce87840f34ca4773a0660edafd3cfc67

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 45m 50s
podified-multinode-edpm-deployment-crc FAILURE in 1h 12m 34s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 30m 33s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 25s
cifmw-pod-pre-commit FAILURE in 7m 50s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 46s
✔️ build-push-container-cifmw-client SUCCESS in 18m 16s
✔️ cifmw-molecule-edpm_nvidia_mdev_prepare SUCCESS in 8m 06s

In order to do nvidia vGPU testing, we need to deploy a specific package on
EDPM nodes and do some other actions.
This is a multi-stage role requring a reboot between the two stages hence the
two defined phases.
@sbauza sbauza force-pushed the edpm_nvidia_mdev_prepare branch from 195b431 to 9b0d64d Compare February 11, 2025 16:32
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2ed42f7ca65241e89c61c6e152ec1499

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 43m 21s
podified-multinode-edpm-deployment-crc FAILURE in 1h 11m 43s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 29m 31s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 29s
✔️ cifmw-pod-pre-commit SUCCESS in 7m 27s
✔️ cifmw-pod-zuul-files SUCCESS in 4m 49s
✔️ build-push-container-cifmw-client SUCCESS in 23m 00s
✔️ cifmw-molecule-edpm_nvidia_mdev_prepare SUCCESS in 6m 59s

Copy link
Collaborator

@lewisdenny lewisdenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work on this role, nice and clean, thank you for the molecule test and nice README <3

PATH: "{{ cifmw_path }}"
ansible.builtin.command:
cmd: >-
oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml
oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml

- cifmw_edpm_nvidia_mdev_prepare_disable_nouveau | bool
register: _blacklist_nouveau

- name: Make sure that we defined the driver URL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this task to be the first that runs(above "Blacklist nouveau"), we should do our asserts at the beginning of the taskfile

path: /var/lib/openstack/reboot_required/
state: directory
mode: "0755"
- name: Create required file to enforce a reboot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a newline between these two tasks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either delete this file or add a single debug task that explains this role should not be ran standalone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants