-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for LocalCUDACluster with MIG #674
Conversation
… devices if available
Thanks @akaanirban for opening this PR. I should add that this builds off of #671 so we will want that merged before this, and as we discussed offline we'll want to have some tests here even if they don't run in CI. |
As for style errors I suggest you you |
Also removed my GH handle from the description as if you have your handle there after merged it will forever plague you with mentions whenever there's certain GH actions such as forks and whatnot by other users. |
As we discussed offline we'll probably need something as follows:
Also as discussed offline: I don’t know if there’s a proper solution to that. I think in a first pass what we’ll want to do is something similar to what was suggested in #583 (comment) , except that we would raise a more user-friendly error, instead of
I think we should leave this case for a follow-up discussion, could you file an issue for that?
Also a separate discussion. Once this is in, we should file an issue summarizing the status of that and discuss how to be better handle it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to bump the minimum pynvml
version to 11.0.0
as that was when MIG support was added.
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #674 +/- ##
================================================
+ Coverage 87.83% 89.42% +1.58%
================================================
Files 15 15
Lines 1652 1692 +40
================================================
+ Hits 1451 1513 +62
+ Misses 201 179 -22
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akaanirban apologies for the long time to review this. I was able to test this as well and everything seems to work as expected. Thanks for the excellent work in this PR, particularly for the clever solutions for the tests. I've left a few minor suggestions/requests there that are mostly cosmetic with one or two exceptions, but otherwise it looks really great!
dask_cuda/tests/test_utils.py
Outdated
miguuids.append(mighandle) | ||
except pynvml.NVMLError: | ||
pass | ||
assert len(miguuids) <= 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this magical number 7
? What if there are 8
or more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more accurate assertion would probably be assert len(miguuids) == count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 7 because you can only partition a single MIG enabled GPU into atmost 7 independent MIG instances. But there may be less in reality. nvmlDeviceGetMaxMigDeviceCount
gives us the maximum number of MIG devices/instances that can exist under a given parent NVML device, not the actual number currently present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true today, you never know if this is still going to be the case in a future GPU, so we have to consider that instead of relying on what holds true today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true though. In that case, the upper bound should be count indeed. Let me fix this.
|
||
if index and not str(index).isnumeric(): | ||
# This means index is UUID. This works for both MIG and non-MIG device UUIDs. | ||
handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(str(index))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjzamora we discussed some time ago that NVML doesn't respect CUDA_VISIBLE_DEVICES
, and while that's true, one of the most user-friendly ways to get a MIG device handle is by its UUID. It seems that users would prefer something simpler such as MIG-0-1
(meaning GPU 0, MIG instance 1, for example), instead of the UUID that looks like MIG-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb
. Do you think this is worth bringing up with the NVML (and potentially CUDA) teams?
Added all suggestions except the `len(miguuids) <= 7` Co-authored-by: Peter Andreas Entschev <[email protected]>
Thanks @akaanirban for addressing reviews and for all the work! |
@gpucibot merge |
Looks like the recipe requirement wasn't bumped. @akaanirban, can you please send a PR for that? |
@jakirkham I should probably bump the recipe requirement to pynvml>=11.0.0 as per Jacob's comment. We have pynvml>=11.0.0 |
Requirements were updated in PR ( #883 ) |
Adds support to start LocalCUDACluster and cuda workers on MIG instances by passing in uuids of the mig instances. Builds off of existing PR #671
More specifically this PR does the following:
LocalCUDACluster
as the following:cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=["MIG-uuid1","MIG-uuid2",...])
or by passing them as,
separated strings.Needs Discussion:
0. Apart from manually testing on a MIG instance on the cloud, how would we test this?
LocalCUDACluster
while using MIG instances? By defaultLocalCUDACluster
will try to use all the parent GPUs and run into error.dask.distributed
diagnostics will also fail if we run on MIG enabled GPUs since it usespynvml
APIS for non-MIG-enabled GPUs only at the moment.