-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define a new EDPM role for installing nvidia driver on nodes #2637
base: main
Are you sure you want to change the base?
Define a new EDPM role for installing nvidia driver on nodes #2637
Conversation
Hi @sbauza. Thanks for your PR. I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Thanks for the PR! ❤️ |
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c18f1379352942ab958539d670a22b47 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 30m 40s |
33d9265
to
12dcdb9
Compare
4f89d15
to
7c9fa6e
Compare
recheck |
/ok-to-test |
recheck |
7c9fa6e
to
a8c455a
Compare
How would customers do this? Here is an example custom service for repo setup: https://github.com/openstack-k8s-operators/install_yamls/blob/main/devsetup/edpm/services/dataplane_v1beta1_openstackdataplaneservice_reposetup.yaml At first glance, I am very much against this change. We should use the products features to do this kind of thing. @slagle - wdyt? |
I think we'd expect customers to use the product interfaces for this type of thing. If they can't or need to run ad-hoc ansible for some reason, we want to understand that use case to see if it's something we want to add to the product. Looking at some of the tasks (such as "Regenerate initramfs" and "Create a systemd unit file that will enable SRIOV VFs"), aren't these things that a customer would need to do as well when installing this driver? Do we intend to manually document these steps, or do we intend to provide supported ansible content that customers can use? If it's the ansible content, then that should be in edpm-ansible, with an enabling service in openstack-operator. However, if we intend to only support the driver itself, and not any of the installation steps, then I suppose ci-framework can use ad-hoc methods to put it in place. That does seem more error prone though. |
I probably didn't explained correctly in the PR comment but this role wouldn't be used by our customers directly as we already document how to support vGPUs in RHOSO18 but would be rather used by our downstream CI jobs for testing this vGPU feature. In order to test that, we need to install a proprietary nvidia driver in the host but we can't use a specific role for installing that as it needs some other checks (like disabling the nouveau driver). This is why I created this role which is already tested by another internal job. HTH. |
As @sbauza mentioned we provide documentation with guidelines for installing Nvidia drivers. Since we can't cover all of the potential procedures for Nvidia GPU installation, we tell customers to refer to Nvidia's documentation based on their GPU/Driver(s) needs. Because of this, there are no plans to create dedicated roles in edpm to handle GPU driver installation and setup. This PR is not intended to be leveraged by customers and is purely for the GPUs in our downstream jobs. |
@sbauza @jamepark4 - would it be possible to link the doc's with the manual instructions that you are referring too? My opinion here is, if our customers cannot achieve this by using the documented[1] interface for a customizing the dataplane then we need to improve those interfaces. I understand that this is an area of - every customer/env is different - so shipping the exact code does not make sense, but still customers will have to do it. I fail to understand why we choose to directly interact via Ansible with the EDPM nodes in ci-framework, instead of utilizing the opportunity to ensure the products interfaces for custom dataplane services can be used. |
we are doing this in the ci framework repo because we cannot support automated installation in production. we have discuss this in the past but the there are business and leagaage reasons why we cant include the driver or automate the download. As with ceph the customistaion fo the EDPM node to install the nvidia drivers will be left to the customer to download form the nvida portal themselves and install. it was an intentional decision not to use the ability to define custom dataplane service to do this, it is possible to do but not something redhat can support in a customer env. in past releases we have worked around the installation of the driver in ci by building a overcloud image to use for GPU testing with the content pre installed. |
I understand that.
I am not going to continue the argument. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
a custom dataplen service was not used because we did not want to build a custom ansible runner image and we did not want to put the playbook inline. we also wanted to follow the patten used for ceph and the existing hci_prepare role. |
@sbauza Hi, I'm ok with your answers, let me know when you have the patch ready with the content you agreed to change and I'll give my approval. |
a8c455a
to
2e01023
Compare
New changes are detected. LGTM label has been removed. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes are made based on your comments
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/65c096a75be94e539530c85cadf0b4ad ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 37m 26s |
2e01023
to
195b431
Compare
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/ce87840f34ca4773a0660edafd3cfc67 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 45m 50s |
In order to do nvidia vGPU testing, we need to deploy a specific package on EDPM nodes and do some other actions. This is a multi-stage role requring a reboot between the two stages hence the two defined phases.
195b431
to
9b0d64d
Compare
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2ed42f7ca65241e89c61c6e152ec1499 ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 43m 21s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work on this role, nice and clean, thank you for the molecule test and nice README <3
PATH: "{{ cifmw_path }}" | ||
ansible.builtin.command: | ||
cmd: >- | ||
oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml | |
oc get OpenStackBaremetalSet -n "{{ namespace|default('openstack') }}" -o yaml |
- cifmw_edpm_nvidia_mdev_prepare_disable_nouveau | bool | ||
register: _blacklist_nouveau | ||
|
||
- name: Make sure that we defined the driver URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this task to be the first that runs(above "Blacklist nouveau"), we should do our asserts at the beginning of the taskfile
path: /var/lib/openstack/reboot_required/ | ||
state: directory | ||
mode: "0755" | ||
- name: Create required file to enforce a reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a newline between these two tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would either delete this file or add a single debug task that explains this role should not be ran standalone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete
In order to do nvidia vGPU testing, we need to deploy a specific package on EDPM nodes and do some other actions.
This is a multi-stage role requring a reboot between the two stages hence the two defined phases.