-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu support #1973
Gpu support #1973
Conversation
Those helper utils to help list (and find) gpu devices
Preparation for node information update. this has to wait until the chain has the changes needed for this
This is for users to look up available GPU types on nodes Also include the SLOT in the GPU ID. The gpu id is what will be used by the user to specify which gpu to attach to his VM
-this include loading correct modules -and bind to correct devices
we need to make sure all devices inside the same gpu iommu group are bind to the vfio driver
Everything seems to be working except the CH process fails with this error: Could not mmap sparse area (offset = 0x0, size = 0x10000000): Resource busy (os error 16) Error booting VM: VmBoot(DeviceManager(VfioMapRegion(MmapArea))) Investingating what can be the error but no luck yet
} | ||
|
||
var ( | ||
//go:embed pci/pci.ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we keep this file always updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, i will add a go:generate
statement that auto-download the file. the go generate
call still has to be done before building. I will add generate to the CI as well
require.NoError(t, err) | ||
|
||
for _, device := range devices { | ||
fmt.Println(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we here assert for gpu devices inclusion in device list instead of debug prints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
@@ -173,3 +173,7 @@ func (r *ResourceOracle) GetHypervisor() (string, error) { | |||
|
|||
return "", nil | |||
} | |||
|
|||
func (r *ResourceOracle) GPUs() ([]PCI, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an exported function I believe it should have a comment
pciDir = "/sys/bus/pci/devices" | ||
) | ||
|
||
type Device struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document the exported types please
if err != nil && !errors.Is(err, substrate.ErrNotFound) { | ||
return nil, fmt.Errorf("failed to check node rent state") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reach this line in case of error is ErrNotFound, also I don't understand the logic in error == nil && rent != 0 and how it can be affected with ErrNotFound :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you said, this only handles any error during the "contract get" call above that is ErrNotFound. so basically return if we have an error that is not ErrNotFound.
The thing is, if there are NO rent contract for the node the call normally returns 0
but imho this is not correct it instead should return NotFound error. So i am trying to be a little bit more defensive in the code by handling the error correctly.
But anyway, if no error or ErrNotFound error was returned then (in both cases) 0 means No rent, and any other value is rented, hence later i make sure that the node is only marked as rented if and only if there is no error and the rent contract has non zero value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good to me, thanks @muhamadazmy
Fixes #1972
The code takes into account: