Support for LocalCUDACluster with MIG #674

akaanirban · 2021-07-16T16:52:03Z

Adds support to start LocalCUDACluster and cuda workers on MIG instances by passing in uuids of the mig instances. Builds off of existing PR #671
More specifically this PR does the following:

Allows starting LocalCUDACluster as the following: cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=["MIG-uuid1","MIG-uuid2",...]) or by passing them as , separated strings.

Needs Discussion:
0. Apart from manually testing on a MIG instance on the cloud, how would we test this?

What if the user does not pass in any argument to LocalCUDACluster while using MIG instances? By default LocalCUDACluster will try to use all the parent GPUs and run into error.
What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs?
dask.distributed diagnostics will also fail if we run on MIG enabled GPUs since it uses pynvml APIS for non-MIG-enabled GPUs only at the moment.

…UUID specified

… devices if available

pentschev · 2021-07-16T17:08:14Z

Thanks @akaanirban for opening this PR. I should add that this builds off of #671 so we will want that merged before this, and as we discussed offline we'll want to have some tests here even if they don't run in CI.

pentschev · 2021-07-16T17:13:49Z

As for style errors I suggest you you pip install pre-commit and then run pre-commit install inside the dask-cuda directory, and that will automatically fix the issues on new commits, or at least inform what the issues are for flake8 errors.

pentschev · 2021-07-16T17:16:18Z

Also removed my GH handle from the description as if you have your handle there after merged it will forever plague you with mentions whenever there's certain GH actions such as forks and whatnot by other users.

pentschev · 2021-07-16T17:22:25Z

Apart from manually testing on a MIG instance on the cloud, how would we test this?

As we discussed offline we'll probably need something as follows:

Programatically find whether there are MIG instances setup, skip tests if not;
If there are MIG instances, we’ll use all or some of them (which we’ll have to query for UUIDs also programatically) to setup cluster and ensure they work.

What if the user does not pass in any argument to LocalCUDACluster while using MIG instances? By default LocalCUDACluster will try to use all the parent GPUs and run into error.

Also as discussed offline:

I don’t know if there’s a proper solution to that. I think in a first pass what we’ll want to do is something similar to what was suggested in #583 (comment) , except that we would raise a more user-friendly error, instead of Insufficient Permissions we would write something like NVML failed with Insufficient Permissions, this may be due to MIG instances on the system, make sure you explicitly specify which GPUs are to be used with 'CUDA_VISIBLE_DEVICES', and probably document this error and link to that somewhere in https://docs.rapids.ai/api/dask-cuda/nightly/index.html where we would detail what are accepted ways to configure the cluster. Over time we will probably have more usage with MIG systems and then be able to reevaluate better ways to deal with that automatically.

What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs?

I think we should leave this case for a follow-up discussion, could you file an issue for that?

dask.distributed diagnostics will also fail if we run on MIG enabled GPUs since it uses pynvml APIS for non-MIG-enabled GPUs only at the moment.

Also a separate discussion. Once this is in, we should file an issue summarizing the status of that and discuss how to be better handle it.

jacobtomlinson

We probably need to bump the minimum pynvml version to 11.0.0 as that was when MIG support was added.

codecov-commenter · 2021-07-20T17:39:16Z

Codecov Report

Merging #674 (f68f63a) into branch-21.10 (c04856e) will increase coverage by 1.58%.
The diff coverage is 63.15%.

@@               Coverage Diff                @@
##           branch-21.10     #674      +/-   ##
================================================
+ Coverage         87.83%   89.42%   +1.58%     
================================================
  Files                15       15              
  Lines              1652     1692      +40     
================================================
+ Hits               1451     1513      +62     
+ Misses              201      179      -22

Impacted Files	Coverage Δ
dask_cuda/utils.py	`84.25% <63.15%> (-3.37%)`	⬇️
dask_cuda/local_cuda_cluster.py	`77.88% <0.00%> (-1.53%)`	⬇️
dask_cuda/cuda_worker.py	`77.64% <0.00%> (-0.67%)`	⬇️
dask_cuda/explicit_comms/dataframe/shuffle.py	`98.69% <0.00%> (+0.65%)`	⬆️
dask_cuda/proxy_object.py	`90.73% <0.00%> (+1.08%)`	⬆️
dask_cuda/cli/dask_cuda_worker.py	`97.14% <0.00%> (+1.42%)`	⬆️
dask_cuda/device_host_file.py	`71.77% <0.00%> (+1.61%)`	⬆️
dask_cuda/initialize.py	`92.59% <0.00%> (+3.70%)`	⬆️
dask_cuda/proxify_device_objects.py	`95.72% <0.00%> (+6.83%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c04856e...f68f63a. Read the comment docs.

…hod in utils

pentschev

@akaanirban apologies for the long time to review this. I was able to test this as well and everything seems to work as expected. Thanks for the excellent work in this PR, particularly for the clever solutions for the tests. I've left a few minor suggestions/requests there that are mostly cosmetic with one or two exceptions, but otherwise it looks really great!

dask_cuda/tests/test_local_cuda_cluster.py

dask_cuda/tests/test_dask_cuda_worker.py

dask_cuda/tests/test_local_cuda_cluster.py

pentschev · 2021-07-30T21:17:15Z

dask_cuda/tests/test_utils.py

+                    miguuids.append(mighandle)
+                except pynvml.NVMLError:
+                    pass
+            assert len(miguuids) <= 7


Why this magical number 7? What if there are 8 or more?

A more accurate assertion would probably be assert len(miguuids) == count.

I added 7 because you can only partition a single MIG enabled GPU into atmost 7 independent MIG instances. But there may be less in reality. nvmlDeviceGetMaxMigDeviceCount gives us the maximum number of MIG devices/instances that can exist under a given parent NVML device, not the actual number currently present.

This is only true today, you never know if this is still going to be the case in a future GPU, so we have to consider that instead of relying on what holds true today.

That is true though. In that case, the upper bound should be count indeed. Let me fix this.

dask_cuda/utils.py

pentschev · 2021-07-30T21:36:51Z

dask_cuda/utils.py

+
+    if index and not str(index).isnumeric():
+        # This means index is UUID. This works for both MIG and non-MIG device UUIDs.
+        handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(str(index)))


@rjzamora we discussed some time ago that NVML doesn't respect CUDA_VISIBLE_DEVICES, and while that's true, one of the most user-friendly ways to get a MIG device handle is by its UUID. It seems that users would prefer something simpler such as MIG-0-1 (meaning GPU 0, MIG instance 1, for example), instead of the UUID that looks like MIG-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb. Do you think this is worth bringing up with the NVML (and potentially CUDA) teams?

dask_cuda/utils.py

Added all suggestions except the `len(miguuids) <= 7` Co-authored-by: Peter Andreas Entschev <[email protected]>

…r text

pentschev · 2021-08-02T18:17:15Z

Thanks @akaanirban for addressing reviews and for all the work!

pentschev · 2021-08-02T18:17:19Z

@gpucibot merge

jakirkham · 2021-10-09T04:27:31Z

Looks like the recipe requirement wasn't bumped. @akaanirban, can you please send a PR for that?

akaanirban · 2021-10-09T05:27:57Z

@jakirkham I should probably bump the recipe requirement to pynvml>=11.0.0 as per Jacob's comment. We have pynvml>=11.0.0
in the requirements.txt though.

jakirkham · 2022-04-13T06:00:00Z

Requirements were updated in PR ( #883 )

akaanirban added 2 commits July 15, 2021 17:33

Added changes for LOCALCUDACluster when CUDA_VISIBLE_DEVICES has MIG-…

1f1a15e

…UUID specified

Added test from pull req rapidsai#671 and test to check number of mig…

1452c7b

… devices if available

akaanirban requested a review from a team as a code owner July 16, 2021 16:52

github-actions bot added the python python code needed label Jul 16, 2021

pentschev added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Jul 16, 2021

jacobtomlinson reviewed Jul 19, 2021

View reviewed changes

akaanirban added 2 commits July 20, 2021 08:13

Fixed styling, bumped minimimum pynvml version to 11.0.0

8a38281

Removed redundant read from os.environ, fixed mig test in utils

d7a2877

akaanirban and others added 3 commits July 21, 2021 08:59

Added a test case for testing all available MIG workers

09ff4cb

Merge branch 'rapidsai:branch-21.08' into dask-cuda-mig

40a3bef

Added test cases and merged upstream changes in nvml_device_index met…

9196c90

…hod in utils

akaanirban changed the title ~~[WIP] Support for LocalCUDACluster with MIG~~ [REVIEW] Support for LocalCUDACluster with MIG Jul 27, 2021

pentschev added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 27, 2021

Riebart mentioned this pull request Jul 27, 2021

[BUG] pynvml.smi.DeviceQuery() errors when run in the Intro01 demo notebook due to bad device brand (10) returned rapidsai-community/notebooks-contrib#338

Closed

pentschev changed the base branch from branch-21.08 to branch-21.10 July 30, 2021 21:10

pentschev requested changes Jul 30, 2021

View reviewed changes

akaanirban and others added 2 commits August 2, 2021 11:55

Applied suggestions from code review

ff4d00e

Added all suggestions except the `len(miguuids) <= 7` Co-authored-by: Peter Andreas Entschev <[email protected]>

Changed hardcoded count from 7 to count in test_utils and added helpe…

f68f63a

…r text

pentschev changed the title ~~[REVIEW] Support for LocalCUDACluster with MIG~~ Support for LocalCUDACluster with MIG Aug 2, 2021

pentschev approved these changes Aug 2, 2021

View reviewed changes

rapids-bot bot merged commit b3bda5d into rapidsai:branch-21.10 Aug 2, 2021

beckernick mentioned this pull request Jan 6, 2022

[BUG] LocalCUDACluster doesn't work with NVIDIA MIG #583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for LocalCUDACluster with MIG #674

Support for LocalCUDACluster with MIG #674

akaanirban commented Jul 16, 2021 •

edited by pentschev

Loading

pentschev commented Jul 16, 2021

pentschev commented Jul 16, 2021

pentschev commented Jul 16, 2021

pentschev commented Jul 16, 2021

jacobtomlinson left a comment

codecov-commenter commented Jul 20, 2021 •

edited

Loading

pentschev left a comment

pentschev Jul 30, 2021

pentschev Jul 30, 2021

akaanirban Aug 2, 2021

pentschev Aug 2, 2021

akaanirban Aug 2, 2021

pentschev Jul 30, 2021

pentschev commented Aug 2, 2021

pentschev commented Aug 2, 2021

jakirkham commented Oct 9, 2021

akaanirban commented Oct 9, 2021

jakirkham commented Apr 13, 2022

Support for LocalCUDACluster with MIG #674

Support for LocalCUDACluster with MIG #674

Conversation

akaanirban commented Jul 16, 2021 • edited by pentschev Loading

pentschev commented Jul 16, 2021

pentschev commented Jul 16, 2021

pentschev commented Jul 16, 2021

pentschev commented Jul 16, 2021

jacobtomlinson left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 20, 2021 • edited Loading

Codecov Report

pentschev left a comment

Choose a reason for hiding this comment

pentschev Jul 30, 2021

Choose a reason for hiding this comment

pentschev Jul 30, 2021

Choose a reason for hiding this comment

akaanirban Aug 2, 2021

Choose a reason for hiding this comment

pentschev Aug 2, 2021

Choose a reason for hiding this comment

akaanirban Aug 2, 2021

Choose a reason for hiding this comment

pentschev Jul 30, 2021

Choose a reason for hiding this comment

pentschev commented Aug 2, 2021

pentschev commented Aug 2, 2021

jakirkham commented Oct 9, 2021

akaanirban commented Oct 9, 2021

jakirkham commented Apr 13, 2022

akaanirban commented Jul 16, 2021 •

edited by pentschev

Loading

codecov-commenter commented Jul 20, 2021 •

edited

Loading