Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small tweaks to make compatible with dask-mpi #656

Merged
merged 2 commits into from
Jun 28, 2021

Conversation

jacobtomlinson
Copy link
Member

While working on GPU support in dask-mpi I ran into a few issues with CUDAWorker. This PR makes a few tweaks to ensure things are compatible.

  1. A valid option for local_directory is an empty string representing the current directory. So added a check before trying to create the directory.

  2. Integers are valid options for name so added an explicit cast to string when concatenating with GPU index.

  3. In dask-mpi it uses async with for creating Worker and Nanny objects. This broke with CUDAWorker because __aenter__ and __aexit__ have not been implemented. Looking at distributed those methods are implemented in the base Server class, so I've updated CUDAWorker here to also use Server as a base.

    I was surprised this wasn't the case already, but perhaps that was an intentional decision? If so it would be quick to implement the async context manager methods directly here instead.

@jacobtomlinson jacobtomlinson requested a review from a team as a code owner June 16, 2021 13:32
@github-actions github-actions bot added the python python code needed label Jun 16, 2021
@jacobtomlinson jacobtomlinson added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change labels Jun 16, 2021
Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @jacobtomlinson, thanks.

FYI: I am working on the CI error: #658

@madsbk
Copy link
Member

madsbk commented Jun 21, 2021

3. I was surprised this wasn't the case already, but perhaps that was an intentional decision? If so it would be quick to implement the async context manager methods directly here instead.

Yeah, that is strange. I don't think there is any reason for that. @pentschev or @quasiben do you know if it is deliberate that CUDAWorker doesn't inherit from distributed.Server?

@pentschev
Copy link
Member

Yeah, that is strange. I don't think there is any reason for that. @pentschev or @quasiben do you know if it is deliberate that CUDAWorker doesn't inherit from distributed.Server?

I think there's no actual reason for that besides it not being a class that's trying to reproduce the behavior of Nanny or Worker currently, but simply instantiates Nanny. Originally, I think there was only the dask-cuda-worker CLI and @jacobtomlinson wrote CUDAWorker based on that and changed dask-cuda-worker to call CUDAWorker instead, IIRC, the reason for that was to ensure we could use it with SSHCluster. With that said, I'm not sure whether there's a reason we can't/shouldn't be doing it.

3. Looking at distributed those methods are implemented in the base Server class, so I've updated CUDAWorker here to also use Server as a base.

I haven't looked at it in detail to figure whether there's a difference, but it seems that the inheritance is from ServerNode and not Server in both Nanny and Worker. Are Server and ServerNode the same?

@jacobtomlinson
Copy link
Member Author

@jacobtomlinson wrote CUDAWorker based on that and changed dask-cuda-worker to call CUDAWorker instead

Heh it's probably my fault that it doesn't inherit from anything.

Are Server and ServerNode the same?

ServerNode adds some extra methods for starting up web server based services like the dashboard, but as you say we don't need any of that here. My main requirement is that CUDAWorker behaves like a Worker from an asyncio point of view, which inheriting from Server should cover.

@pentschev
Copy link
Member

rerun tests

@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@f3220a4). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 21d9f44 differs from pull request most recent head 7377716. Consider uploading reports for the commit 7377716 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08     #656   +/-   ##
===============================================
  Coverage                ?   90.32%           
===============================================
  Files                   ?       15           
  Lines                   ?     1644           
  Branches                ?        0           
===============================================
  Hits                    ?     1485           
  Misses                  ?      159           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3220a4...7377716. Read the comment docs.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that all tests now pass but the old CUDA 11.0 build is not running anymore, leaving CI in a failed state, we probably will need @rapidsai/ops-codeowners to force merge this.

@ajschmidt8
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 939bc26 into rapidsai:branch-21.08 Jun 28, 2021
@jacobtomlinson
Copy link
Member Author

Thanks @ajschmidt8

@jacobtomlinson jacobtomlinson deleted the dask-mpi-fixes branch June 28, 2021 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants