Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for LocalCUDACluster with MIG #674

Merged
merged 9 commits into from
Aug 2, 2021

Conversation

akaanirban
Copy link
Contributor

@akaanirban akaanirban commented Jul 16, 2021

Adds support to start LocalCUDACluster and cuda workers on MIG instances by passing in uuids of the mig instances. Builds off of existing PR #671
More specifically this PR does the following:

  1. Allows starting LocalCUDACluster as the following: cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=["MIG-uuid1","MIG-uuid2",...]) or by passing them as , separated strings.

Needs Discussion:
0. Apart from manually testing on a MIG instance on the cloud, how would we test this?

  1. What if the user does not pass in any argument to LocalCUDACluster while using MIG instances? By default LocalCUDACluster will try to use all the parent GPUs and run into error.
  2. What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs?
  3. dask.distributed diagnostics will also fail if we run on MIG enabled GPUs since it uses pynvml APIS for non-MIG-enabled GPUs only at the moment.

@akaanirban akaanirban requested a review from a team as a code owner July 16, 2021 16:52
@github-actions github-actions bot added the python python code needed label Jul 16, 2021
@pentschev pentschev added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Jul 16, 2021
@pentschev
Copy link
Member

Thanks @akaanirban for opening this PR. I should add that this builds off of #671 so we will want that merged before this, and as we discussed offline we'll want to have some tests here even if they don't run in CI.

@pentschev
Copy link
Member

As for style errors I suggest you you pip install pre-commit and then run pre-commit install inside the dask-cuda directory, and that will automatically fix the issues on new commits, or at least inform what the issues are for flake8 errors.

@pentschev
Copy link
Member

Also removed my GH handle from the description as if you have your handle there after merged it will forever plague you with mentions whenever there's certain GH actions such as forks and whatnot by other users.

@pentschev
Copy link
Member

  1. Apart from manually testing on a MIG instance on the cloud, how would we test this?

As we discussed offline we'll probably need something as follows:

  • Programatically find whether there are MIG instances setup, skip tests if not;
  • If there are MIG instances, we’ll use all or some of them (which we’ll have to query for UUIDs also programatically) to setup cluster and ensure they work.
  1. What if the user does not pass in any argument to LocalCUDACluster while using MIG instances? By default LocalCUDACluster will try to use all the parent GPUs and run into error.

Also as discussed offline:

I don’t know if there’s a proper solution to that. I think in a first pass what we’ll want to do is something similar to what was suggested in #583 (comment) , except that we would raise a more user-friendly error, instead of Insufficient Permissions we would write something like NVML failed with Insufficient Permissions, this may be due to MIG instances on the system, make sure you explicitly specify which GPUs are to be used with 'CUDA_VISIBLE_DEVICES', and probably document this error and link to that somewhere in https://docs.rapids.ai/api/dask-cuda/nightly/index.html where we would detail what are accepted ways to configure the cluster. Over time we will probably have more usage with MIG systems and then be able to reevaluate better ways to deal with that automatically.

  1. What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs?

I think we should leave this case for a follow-up discussion, could you file an issue for that?

  1. dask.distributed diagnostics will also fail if we run on MIG enabled GPUs since it uses pynvml APIS for non-MIG-enabled GPUs only at the moment.

Also a separate discussion. Once this is in, we should file an issue summarizing the status of that and discuss how to be better handle it.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to bump the minimum pynvml version to 11.0.0 as that was when MIG support was added.

@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2021

Codecov Report

Merging #674 (f68f63a) into branch-21.10 (c04856e) will increase coverage by 1.58%.
The diff coverage is 63.15%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10     #674      +/-   ##
================================================
+ Coverage         87.83%   89.42%   +1.58%     
================================================
  Files                15       15              
  Lines              1652     1692      +40     
================================================
+ Hits               1451     1513      +62     
+ Misses              201      179      -22     
Impacted Files Coverage Δ
dask_cuda/utils.py 84.25% <63.15%> (-3.37%) ⬇️
dask_cuda/local_cuda_cluster.py 77.88% <0.00%> (-1.53%) ⬇️
dask_cuda/cuda_worker.py 77.64% <0.00%> (-0.67%) ⬇️
dask_cuda/explicit_comms/dataframe/shuffle.py 98.69% <0.00%> (+0.65%) ⬆️
dask_cuda/proxy_object.py 90.73% <0.00%> (+1.08%) ⬆️
dask_cuda/cli/dask_cuda_worker.py 97.14% <0.00%> (+1.42%) ⬆️
dask_cuda/device_host_file.py 71.77% <0.00%> (+1.61%) ⬆️
dask_cuda/initialize.py 92.59% <0.00%> (+3.70%) ⬆️
dask_cuda/proxify_device_objects.py 95.72% <0.00%> (+6.83%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c04856e...f68f63a. Read the comment docs.

@akaanirban akaanirban changed the title [WIP] Support for LocalCUDACluster with MIG [REVIEW] Support for LocalCUDACluster with MIG Jul 27, 2021
@pentschev pentschev added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 27, 2021
@pentschev pentschev changed the base branch from branch-21.08 to branch-21.10 July 30, 2021 21:10
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akaanirban apologies for the long time to review this. I was able to test this as well and everything seems to work as expected. Thanks for the excellent work in this PR, particularly for the clever solutions for the tests. I've left a few minor suggestions/requests there that are mostly cosmetic with one or two exceptions, but otherwise it looks really great!

miguuids.append(mighandle)
except pynvml.NVMLError:
pass
assert len(miguuids) <= 7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this magical number 7? What if there are 8 or more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more accurate assertion would probably be assert len(miguuids) == count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added 7 because you can only partition a single MIG enabled GPU into atmost 7 independent MIG instances. But there may be less in reality. nvmlDeviceGetMaxMigDeviceCount gives us the maximum number of MIG devices/instances that can exist under a given parent NVML device, not the actual number currently present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true today, you never know if this is still going to be the case in a future GPU, so we have to consider that instead of relying on what holds true today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true though. In that case, the upper bound should be count indeed. Let me fix this.


if index and not str(index).isnumeric():
# This means index is UUID. This works for both MIG and non-MIG device UUIDs.
handle = pynvml.nvmlDeviceGetHandleByUUID(str.encode(str(index)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjzamora we discussed some time ago that NVML doesn't respect CUDA_VISIBLE_DEVICES, and while that's true, one of the most user-friendly ways to get a MIG device handle is by its UUID. It seems that users would prefer something simpler such as MIG-0-1 (meaning GPU 0, MIG instance 1, for example), instead of the UUID that looks like MIG-84fd49f2-48ad-50e8-9f2e-3bf0dfd47ccb. Do you think this is worth bringing up with the NVML (and potentially CUDA) teams?

akaanirban and others added 2 commits August 2, 2021 11:55
Added all suggestions except the `len(miguuids) <= 7`

Co-authored-by: Peter Andreas Entschev <[email protected]>
@pentschev pentschev changed the title [REVIEW] Support for LocalCUDACluster with MIG Support for LocalCUDACluster with MIG Aug 2, 2021
@pentschev
Copy link
Member

Thanks @akaanirban for addressing reviews and for all the work!

@pentschev
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b3bda5d into rapidsai:branch-21.10 Aug 2, 2021
@jakirkham
Copy link
Member

Looks like the recipe requirement wasn't bumped. @akaanirban, can you please send a PR for that?

@akaanirban
Copy link
Contributor Author

@jakirkham I should probably bump the recipe requirement to pynvml>=11.0.0 as per Jacob's comment. We have pynvml>=11.0.0
in the requirements.txt though.

@jakirkham
Copy link
Member

Requirements were updated in PR ( #883 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants