Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Support #1293

Closed
andhartl opened this issue Jun 4, 2021 · 2 comments
Closed

GPU Support #1293

andhartl opened this issue Jun 4, 2021 · 2 comments
Assignees
Labels
type_feature New feature or request
Milestone

Comments

@andhartl
Copy link

andhartl commented Jun 4, 2021

We get a lot of questions on GPU support from AI/ML users and farmers. So the demand is definitely there.
We had the GPU support on the roadmap a while ago but I do not know where we are at right now. Can we open a discussion about it?

@andhartl andhartl added the type_feature New feature or request label Jun 4, 2021
@xmonader xmonader added this to the next milestone Oct 28, 2021
@muhamadazmy
Copy link
Member

I think we can start working on this now. I know it has been in the backlog for a long time. I will move it to the current active project.

Questions I need to research:

  • Working with GPUs with cloud-hypervisors
  • Can a node has multiple GPUs ?
  • Tracking of GPU(s) of the node and if it's free to be allocated by a VM This probably need to be added to node contract

@OmarElawady
Copy link
Contributor

OmarElawady commented Mar 17, 2022

GPUs can be attached to a VM using cloud-hypervisor by unbinding it from its driver and then bind it to vfio driver as described here.

I have a nvidia GPU, and couldn't do dynamic unbinding/binding while the machine is running. So instead I gave vfio control over the gpu (and other neighboring devices) through kernel params as described here. I imagine it won't be necessary on the node since the gpu shouldn't be bound to any driver but I didn't get to this yet.

The part about "neighboring devices" is that the gpu belongs to an "IOMMU group" and the VM should control all devices belonging to this group, in my case it was an audio and a usb device. It's possible to bypass this but it's with risks (didn't read them yet).

The gpu appears successfully in the VM but a driver should be installed then to allow using it. AFAIK, the kernel we use doesn't allow dynamic module using. So it must be enabled to do so (or the driver should be pre-installed(?), but it looks like a complicated solution).

This all was tried on my machine, not a node. I think its kernel must be updated to include vfio support.

TLDR:
Done:

  • attaching the GPU to the VM through cloud-hypervisor on a normal kernel

Next:

  • Updating the node's kernel with vfio support and trying this on it instead
  • Looking into how the GPU driver can be used inside the zmachine (By updating the kernel to allow module dynamic loading)

Notes:

  • The GPU is accompanied with neighboring devices which won't be known until runtime which might pose a security problem (we don't want a zmachine owner to control the usb device of the node).

@rkhamis rkhamis modified the milestones: next, 3.1.0 Jun 20, 2022
@rkhamis rkhamis added this to 3.7.x Jun 20, 2022
@rkhamis rkhamis moved this to 🔖 Ready in 3.7.x Jun 20, 2022
@xmonader xmonader modified the milestones: 3.1.0, now Jul 4, 2022
@xmonader xmonader added this to 3.9.0 Jul 6, 2022
@xmonader xmonader removed this from 3.7.x Jul 6, 2022
@xmonader xmonader modified the milestones: 3.1.0, 3.2.0 Jul 6, 2022
@despiegk despiegk removed this from 3.9.0 Nov 14, 2022
@xmonader xmonader added this to 3.10.x Nov 17, 2022
@xmonader xmonader modified the milestones: 3.4.x, 3.5.x Nov 17, 2022
@muhamadazmy muhamadazmy moved this to Blocked in 3.10.x Mar 9, 2023
@muhamadazmy muhamadazmy mentioned this issue Mar 9, 2023
@xmonader xmonader removed this from 3.10.x Mar 22, 2023
@xmonader xmonader added this to 3.11.x Mar 22, 2023
@github-project-automation github-project-automation bot moved this to Done in 3.11.x Mar 22, 2023
This was referenced Jun 1, 2023
@muhamadazmy muhamadazmy self-assigned this Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_feature New feature or request
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants