Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpu support #1973

Merged
merged 15 commits into from
Jun 19, 2023
Merged

Gpu support #1973

merged 15 commits into from
Jun 19, 2023

Conversation

muhamadazmy
Copy link
Member

Fixes #1972

  • Identify and list available GPU
  • Allow Zmachine deployment to define a list of GPU that can be used from the host
  • Setup vfio, and friends
  • Pass devices to zmachine.

The code takes into account:

  • Node must be rented
  • Support for multiple GPUs on the Host (zos)
  • Support passing multiple GPUs to the VMs
  • Validate that a GPU is not used by another VM
  • Make sure all devices that share IOMMU are passed together

Those helper utils to help list (and find) gpu devices
Preparation for node information update. this has to wait
until the chain has the changes needed for this
This is for users to look up available GPU types on nodes

Also include the SLOT in the GPU ID. The gpu id is what will
be used by the user to specify which gpu to attach to his VM
-this include loading correct modules
-and bind to correct devices
we need to make sure all devices inside the same gpu iommu group
are bind to the vfio driver
Everything seems to be working except the CH process fails with this
error:

Could not mmap sparse area (offset = 0x0, size = 0x10000000): Resource busy (os error 16)
Error booting VM: VmBoot(DeviceManager(VfioMapRegion(MmapArea)))

Investingating what can be the error but no luck yet
@muhamadazmy muhamadazmy marked this pull request as ready for review June 16, 2023 11:31
}

var (
//go:embed pci/pci.ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we keep this file always updated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, i will add a go:generate statement that auto-download the file. the go generate call still has to be done before building. I will add generate to the CI as well

require.NoError(t, err)

for _, device := range devices {
fmt.Println(device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we here assert for gpu devices inclusion in device list instead of debug prints?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -173,3 +173,7 @@ func (r *ResourceOracle) GetHypervisor() (string, error) {

return "", nil
}

func (r *ResourceOracle) GPUs() ([]PCI, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an exported function I believe it should have a comment

pciDir = "/sys/bus/pci/devices"
)

type Device struct {
Copy link
Collaborator

@xmonader xmonader Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document the exported types please

pkg/primitives/vm/gpu.go Outdated Show resolved Hide resolved
pkg/provision/engine.go Outdated Show resolved Hide resolved
if err != nil && !errors.Is(err, substrate.ErrNotFound) {
return nil, fmt.Errorf("failed to check node rent state")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reach this line in case of error is ErrNotFound, also I don't understand the logic in error == nil && rent != 0 and how it can be affected with ErrNotFound :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you said, this only handles any error during the "contract get" call above that is ErrNotFound. so basically return if we have an error that is not ErrNotFound.

The thing is, if there are NO rent contract for the node the call normally returns 0 but imho this is not correct it instead should return NotFound error. So i am trying to be a little bit more defensive in the code by handling the error correctly.

But anyway, if no error or ErrNotFound error was returned then (in both cases) 0 means No rent, and any other value is rented, hence later i make sure that the node is only marked as rented if and only if there is no error and the rent contract has non zero value.

Copy link
Collaborator

@xmonader xmonader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me, thanks @muhamadazmy

@muhamadazmy muhamadazmy merged commit df9b1fd into main Jun 19, 2023
@muhamadazmy muhamadazmy deleted the gpu-support branch June 19, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement GPU support
2 participants