Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer #1015

Merged

Conversation

teju85
Copy link
Member

@teju85 teju85 commented Aug 16, 2019

This PR is to showcase a possible solution for the issue #931.
However, for this to happen, the constructor for cumlHandle_impl had to be updated to expose num-streams parameter.

Tagging @cjnolet @JohnZed and @vishalmehta1991 for review.

@teju85
Copy link
Member Author

teju85 commented Aug 16, 2019

Folks, the set of changes so far will for sure break the python world. Will fix that soon.

cpp/src/cuML.hpp Outdated Show resolved Hide resolved
@teju85 teju85 changed the title Use num internal streams instead of creating cumlHandle's inside the C++ layer [REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer Aug 16, 2019
python/cuml/metrics/trustworthiness.pyx Outdated Show resolved Hide resolved
@@ -259,6 +259,8 @@ void foo(const ML::cumlHandle_impl& h, ...)
}
```

An example of how to use internal streams to schedule work on a single GPU can be found in [here](https://github.com/rapidsai/cuml/pull/1015). This PR uses the internal streams inside `cumlHandle_impl` to schedule more work onto the GPU for Random Forest building.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should follow the format we've been using for the rest of the developer guide and provide the example in place. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Done. Can you check now?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. Couple small comments and one nickpick about the Developer Guide link

@teju85
Copy link
Member Author

teju85 commented Aug 19, 2019

I think I have addressed all the review comments. @vishalmehta1991 and @cjnolet please check now.

Also, @vishalmehta1991 had concerns about the conflicts arising due to this PR and his PR #961 . We should discuss how to resolve this before merging either of the two.

@teju85
Copy link
Member Author

teju85 commented Aug 19, 2019

IMO, it is better to take the changes in #961 first, followed by me resolving conflicts arising with the current PR.

@dantegd dantegd added 0 - Blocked Cannot progress due to external reasons CUDA / C++ CUDA issue 3 - Ready for Review Ready for review by team labels Aug 19, 2019
@dantegd
Copy link
Member

dantegd commented Aug 19, 2019

@teju85 added the Blocked! label just to reflect that this PR is waiting on 961

cjnolet
cjnolet previously approved these changes Sep 4, 2019
@cjnolet cjnolet dismissed their stale review September 4, 2019 23:40

Providing message

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @vishalmehta1991's PR has been merged so we should be good to go with this, once the conflicts and the dask-cuda issues are resolved.

@teju85
Copy link
Member Author

teju85 commented Sep 5, 2019

having the same issue as the other PR #823.

@teju85
Copy link
Member Author

teju85 commented Sep 6, 2019

rerun tests

@teju85
Copy link
Member Author

teju85 commented Sep 16, 2019

JFYI, @vishalmehta1991 has requested to stop merging this PR until the PR #1087 gets through.

@teju85
Copy link
Member Author

teju85 commented Sep 16, 2019

@dantegd any ideas why I get the following error in CI?
E ImportError: cannot import name 'TOTAL_MEMORY' from 'distributed.worker' (/conda/envs/gdf/lib/python3.7/site-packages/distributed/worker.py)

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a more thorough review of the changes to the developer guide and have a few notes.

@@ -6,6 +6,7 @@ Please start by reading [CONTRIBUTING.md](../../CONTRIBUTING.md).

## Performance
1. In performance critical sections of the code, favor `cudaDeviceGetAttribute` over `cudaDeviceGetProperties`. See PR [#973](https://github.com/rapidsai/cuml/pull/973) for more details.
2. If an algo requires you to launch GPU work in multiple cuda streams, do not create multiple `cumlHandle` objects, one for each such work stream. Instead, expose a `n_streams` parameter in that algo's cuML C++ interface and then rely on `cumlHandle_impl::getInternalStream()` to pick up the right cuda stream. See PR [#1015](https://github.com/rapidsai/cuml/pull/1015) and also the section on [CUDA Resources](#cuda-resources) for more details. TIP: use `cumlHandle_impl::getNumInternalStreams()` to know how many such streams are available at your disposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I missed this. I'd prefer not to point users to pull requests in the developer guide as it's not straightforward and can quickly get out of date as code is updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you recommend instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the link to CUDA Resources and the TIP are good enough. Maybe we could also link to the example in the threading section. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. How about now?

wiki/cpp/DEVELOPER_GUIDE.md Show resolved Hide resolved
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll add an in-place example to the use of the new internal streams API in the updates to the threading section of the developer guide.

@teju85
Copy link
Member Author

teju85 commented Sep 20, 2019

rerun tests

1 similar comment
@teju85
Copy link
Member Author

teju85 commented Sep 23, 2019

rerun tests

@teju85
Copy link
Member Author

teju85 commented Sep 25, 2019

@dantegd @cjnolet Any suggestions on how to fix this CI error? E TypeError: no default __reduce__ due to non-trivial __cinit__

I need to have a non-default ctor in handle.pyx for the python users to be able to specify the number of streams to be created inside cumlHandle.

@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2019

@teju85, I see what's going on here. The problem is not because you have a non-default __cinit__(), it's because the Dask RF code is trying to pickle the handle to send it to the workers, and pickling a cython class with a non-default __cinit__() requires a non-default __reduce()__ (because there's a well-defined separation between cython variables, which aren't natively pickleable, and Python objects, which are natively PIckleable.

Here's the code that's giving the problem:

        if handle is None:
            handle = cuml.Handle(n_streams)

        self.rfs = {
            worker: c.submit(
                RandomForestClassifier._func_build_rf,
                n,
                self.n_estimators_per_worker[n],
                max_depth,
                handle,
                max_features,
                n_bins,
                split_algo,
                split_criterion,
                min_rows_per_node,
                bootstrap,
                bootstrap_features,
                type_model,
                verbose,
                rows_sample,
                max_leaves,
                n_streams,
                quantile_per_tree,
                dtype,
                random.random(),
                workers=[worker],
            )
            for n, worker in enumerate(workers)
        }

The fix is to pass n_streams to the workers and have the RandomForestClassifier._func_build_rf create the handle (locally on each worker). I'll fix this for you and push, since we're strapped for time in 0.10.

@cjnolet
Copy link
Member

cjnolet commented Sep 26, 2019

If we want to enable the sharing of a cumlHandle on the workers across different runs of algorithms (which will get tricky using NCCL in the comms), we will want to cache the handle on the workers (look into CommsContext to see how I'm doing this). Problem is, this won't be thread-safe, so it might be worth caching based on the id of the thread, and perhaps using some sort of LRU strategy.

@cjnolet cjnolet merged commit 1eabc38 into rapidsai:branch-0.10 Sep 26, 2019
@teju85
Copy link
Member Author

teju85 commented Sep 27, 2019

Awesome. Thanks @cjnolet for getting this PR finally across the border!

@teju85 teju85 deleted the fea-ext-expose-num-internal-streams branch October 1, 2019 05:28
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Mar 30, 2023
…nal-streams

[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Blocked Cannot progress due to external reasons 3 - Ready for Review Ready for review by team CUDA / C++ CUDA issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants