[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer #1015

teju85 · 2019-08-16T10:49:28Z

This PR is to showcase a possible solution for the issue #931.
However, for this to happen, the constructor for cumlHandle_impl had to be updated to expose num-streams parameter.

Tagging @cjnolet @JohnZed and @vishalmehta1991 for review.

…l streams to parallelize more work

teju85 · 2019-08-16T11:16:08Z

Folks, the set of changes so far will for sure break the python world. Will fix that soon.

cpp/src/cuML.hpp

cpp/src/randomforest/randomforest_impl.cuh

python/cuml/common/handle.pyx

python/cuml/metrics/trustworthiness.pyx

cjnolet · 2019-08-16T18:57:09Z

wiki/cpp/DEVELOPER_GUIDE.md

@@ -259,6 +259,8 @@ void foo(const ML::cumlHandle_impl& h, ...)
 }
 ```

+An example of how to use internal streams to schedule work on a single GPU can be found in [here](https://github.com/rapidsai/cuml/pull/1015). This PR uses the internal streams inside `cumlHandle_impl` to schedule more work onto the GPU for Random Forest building.


We should follow the format we've been using for the rest of the developer guide and provide the example in place. What do you think?

Fair point. Done. Can you check now?

cjnolet

Looks great overall. Couple small comments and one nickpick about the Developer Guide link

teju85 · 2019-08-19T05:12:52Z

I think I have addressed all the review comments. @vishalmehta1991 and @cjnolet please check now.

Also, @vishalmehta1991 had concerns about the conflicts arising due to this PR and his PR #961 . We should discuss how to resolve this before merging either of the two.

teju85 · 2019-08-19T05:31:33Z

IMO, it is better to take the changes in #961 first, followed by me resolving conflicts arising with the current PR.

dantegd · 2019-08-19T13:27:23Z

@teju85 added the Blocked! label just to reflect that this PR is waiting on 961

Providing message

cjnolet

LGTM. @vishalmehta1991's PR has been merged so we should be good to go with this, once the conflicts and the dask-cuda issues are resolved.

…ea-ext-expose-num-internal-streams

…olution

teju85 · 2019-09-05T06:33:07Z

having the same issue as the other PR #823.

teju85 · 2019-09-06T10:52:56Z

rerun tests

…ea-ext-expose-num-internal-streams

…com/teju85/cuml into fea-ext-expose-num-internal-streams

teju85 · 2019-09-16T05:53:28Z

JFYI, @vishalmehta1991 has requested to stop merging this PR until the PR #1087 gets through.

teju85 · 2019-09-16T10:41:00Z

@dantegd any ideas why I get the following error in CI?
E ImportError: cannot import name 'TOTAL_MEMORY' from 'distributed.worker' (/conda/envs/gdf/lib/python3.7/site-packages/distributed/worker.py)

cjnolet

Did a more thorough review of the changes to the developer guide and have a few notes.

cjnolet · 2019-09-17T00:47:33Z

wiki/cpp/DEVELOPER_GUIDE.md

@@ -6,6 +6,7 @@ Please start by reading [CONTRIBUTING.md](../../CONTRIBUTING.md).

 ## Performance
 1. In performance critical sections of the code, favor `cudaDeviceGetAttribute` over `cudaDeviceGetProperties`. See PR [#973](https://github.com/rapidsai/cuml/pull/973) for more details.
+2. If an algo requires you to launch GPU work in multiple cuda streams, do not create multiple `cumlHandle` objects, one for each such work stream. Instead, expose a `n_streams` parameter in that algo's cuML C++ interface and then rely on `cumlHandle_impl::getInternalStream()` to pick up the right cuda stream. See PR [#1015](https://github.com/rapidsai/cuml/pull/1015) and also the section on [CUDA Resources](#cuda-resources) for more details. TIP: use `cumlHandle_impl::getNumInternalStreams()` to know how many such streams are available at your disposal.


I'm not sure how I missed this. I'd prefer not to point users to pull requests in the developer guide as it's not straightforward and can quickly get out of date as code is updated.

What do you recommend instead?

IMO, the link to CUDA Resources and the TIP are good enough. Maybe we could also link to the example in the threading section. What do you think?

Done. How about now?

wiki/cpp/DEVELOPER_GUIDE.md

…ea-ext-expose-num-internal-streams

cjnolet

LGTM. I'll add an in-place example to the use of the new internal streams API in the updates to the threading section of the developer guide.

…ea-ext-expose-num-internal-streams

…com/teju85/cuml into fea-ext-expose-num-internal-streams

teju85 · 2019-09-20T15:11:07Z

rerun tests

teju85 · 2019-09-23T04:41:59Z

rerun tests

teju85 · 2019-09-25T08:45:43Z

@dantegd @cjnolet Any suggestions on how to fix this CI error? E TypeError: no default __reduce__ due to non-trivial __cinit__

I need to have a non-default ctor in handle.pyx for the python users to be able to specify the number of streams to be created inside cumlHandle.

cjnolet · 2019-09-26T20:53:36Z

@teju85, I see what's going on here. The problem is not because you have a non-default __cinit__(), it's because the Dask RF code is trying to pickle the handle to send it to the workers, and pickling a cython class with a non-default __cinit__() requires a non-default __reduce()__ (because there's a well-defined separation between cython variables, which aren't natively pickleable, and Python objects, which are natively PIckleable.

Here's the code that's giving the problem:

        if handle is None:
            handle = cuml.Handle(n_streams)

        self.rfs = {
            worker: c.submit(
                RandomForestClassifier._func_build_rf,
                n,
                self.n_estimators_per_worker[n],
                max_depth,
                handle,
                max_features,
                n_bins,
                split_algo,
                split_criterion,
                min_rows_per_node,
                bootstrap,
                bootstrap_features,
                type_model,
                verbose,
                rows_sample,
                max_leaves,
                n_streams,
                quantile_per_tree,
                dtype,
                random.random(),
                workers=[worker],
            )
            for n, worker in enumerate(workers)
        }

The fix is to pass n_streams to the workers and have the RandomForestClassifier._func_build_rf create the handle (locally on each worker). I'll fix this for you and push, since we're strapped for time in 0.10.

cjnolet · 2019-09-26T21:03:06Z

If we want to enable the sharing of a cumlHandle on the workers across different runs of algorithms (which will get tricky using NCCL in the comms), we will want to cache the handle on the workers (look into CommsContext to see how I'm doing this). Problem is, this won't be thread-safe, so it might be worth caching based on the id of the thread, and perhaps using some sort of LRU strategy.

teju85 · 2019-09-27T03:50:50Z

Awesome. Thanks @cjnolet for getting this PR finally across the border!

…nal-streams [REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer

teju85 added 4 commits August 16, 2019 02:44

updated our c++ api to expose a num-streams param to the cumlHandle

4df20fe

updated RF code to not create cumlHandle objects, instead use interna…

8ea1bf5

…l streams to parallelize more work

update changelog

49857e8

Updated dev-guide to add an example of RF for using internal streams

42e046f

vishalmehta1991 reviewed Aug 16, 2019

View reviewed changes

cpp/src/cuML.hpp Outdated Show resolved Hide resolved

vishalmehta1991 reviewed Aug 16, 2019

View reviewed changes

cpp/src/randomforest/randomforest_impl.cuh Outdated Show resolved Hide resolved

teju85 changed the title ~~Use num internal streams instead of creating cumlHandle's inside the C++ layer~~ [REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer Aug 16, 2019

teju85 added 2 commits August 16, 2019 09:59

corresponding set of changes for cumlHandle in the python world

8f3a971

have the default number of internal streams to be 0

ed15600

cjnolet reviewed Aug 16, 2019

View reviewed changes

python/cuml/common/handle.pyx Outdated Show resolved Hide resolved

cjnolet reviewed Aug 16, 2019

View reviewed changes

cjnolet requested changes Aug 16, 2019

View reviewed changes

teju85 added 3 commits August 18, 2019 21:30

use ASSERT instead of warnings for num-streams mismatch

78a76ea

set proper default value for n-streams in the cython-side as well

619cfc7

added an example for the creation and usage of internal streams

8b3cd96

dantegd added 0 - Blocked Cannot progress due to external reasons CUDA / C++ CUDA issue 3 - Ready for Review Ready for review by team labels Aug 19, 2019

cjnolet mentioned this pull request Sep 4, 2019

[REVIEW] High Peformance RF; HIST algo #961

Merged

cjnolet previously approved these changes Sep 4, 2019

View reviewed changes

cjnolet approved these changes Sep 4, 2019

View reviewed changes

teju85 added 2 commits September 4, 2019 23:24

Merge branch 'branch-0.10' of https://github.com/rapidsai/cuml into f…

8b00d17

…ea-ext-expose-num-internal-streams

fixed compilation issue due to a mistake in the previous conflict res…

3a7c90c

…olution

cjnolet mentioned this pull request Sep 15, 2019

[FEA] KNN to use internal streams from cumlHandle #1102

Closed

teju85 added 3 commits September 15, 2019 22:48

Merge branch 'branch-0.10' of https://github.com/rapidsai/cuml into f…

9bfe8e4

…ea-ext-expose-num-internal-streams

updated fil tests as well to take care of n_streams issue

eaa550e

Merge branch 'fea-ext-expose-num-internal-streams' of https://github.…

db818f5

…com/teju85/cuml into fea-ext-expose-num-internal-streams

teju85 added 3 commits September 15, 2019 23:00

added a section in our dev-guide about the usage of internal straems

b23357f

tip about the usage of getNumInternalStreams

14cd122

linked CUDA Resources section to the point on internal streams

4788cca

cjnolet requested changes Sep 17, 2019

View reviewed changes

teju85 added 2 commits September 17, 2019 06:39

Merge branch 'branch-0.10' of https://github.com/rapidsai/cuml into f…

6f3e411

…ea-ext-expose-num-internal-streams

Remove links to PR and instead use links to sections

dbe54b0

cjnolet approved these changes Sep 19, 2019

View reviewed changes

teju85 added 3 commits September 20, 2019 00:18

Merge branch 'branch-0.10' of https://github.com/rapidsai/cuml into f…

83e1f98

…ea-ext-expose-num-internal-streams

Merge branch 'fea-ext-expose-num-internal-streams' of https://github.…

d77ac96

…com/teju85/cuml into fea-ext-expose-num-internal-streams

removed the RF with fil unit-test as per the previous PR from Vishal

fa8864c

teju85 mentioned this pull request Sep 24, 2019

[REVIEW] Create include folder for C++ API distribution #1129

Merged

9 tasks

cjnolet added 2 commits September 26, 2019 17:10

Creating handle on workers

f4c0c25

Making stylechecker happy

30f77ba

cjnolet merged commit 1eabc38 into rapidsai:branch-0.10 Sep 26, 2019

teju85 deleted the fea-ext-expose-num-internal-streams branch October 1, 2019 05:28

jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Mar 30, 2023

Merge pull request rapidsai#1015 from teju85/fea-ext-expose-num-inter…

e0e4f80

…nal-streams [REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer #1015

[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer #1015

teju85 commented Aug 16, 2019

teju85 commented Aug 16, 2019

cjnolet Aug 16, 2019

teju85 Aug 19, 2019

cjnolet left a comment

teju85 commented Aug 19, 2019

teju85 commented Aug 19, 2019

dantegd commented Aug 19, 2019

cjnolet left a comment •

edited

Loading

teju85 commented Sep 5, 2019

teju85 commented Sep 6, 2019

teju85 commented Sep 16, 2019

teju85 commented Sep 16, 2019

cjnolet left a comment

cjnolet Sep 17, 2019

teju85 Sep 17, 2019

cjnolet Sep 18, 2019

teju85 Sep 19, 2019

cjnolet left a comment

teju85 commented Sep 20, 2019

teju85 commented Sep 23, 2019

teju85 commented Sep 25, 2019

cjnolet commented Sep 26, 2019

cjnolet commented Sep 26, 2019

teju85 commented Sep 27, 2019

[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer #1015

[REVIEW] Use num internal streams instead of creating cumlHandle's inside the C++ layer #1015

Conversation

teju85 commented Aug 16, 2019

teju85 commented Aug 16, 2019

cjnolet Aug 16, 2019

Choose a reason for hiding this comment

teju85 Aug 19, 2019

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

teju85 commented Aug 19, 2019

teju85 commented Aug 19, 2019

dantegd commented Aug 19, 2019

cjnolet left a comment • edited Loading

Choose a reason for hiding this comment

teju85 commented Sep 5, 2019

teju85 commented Sep 6, 2019

teju85 commented Sep 16, 2019

teju85 commented Sep 16, 2019

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet Sep 17, 2019

Choose a reason for hiding this comment

teju85 Sep 17, 2019

Choose a reason for hiding this comment

cjnolet Sep 18, 2019

Choose a reason for hiding this comment

teju85 Sep 19, 2019

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

teju85 commented Sep 20, 2019

teju85 commented Sep 23, 2019

teju85 commented Sep 25, 2019

cjnolet commented Sep 26, 2019

cjnolet commented Sep 26, 2019

teju85 commented Sep 27, 2019

cjnolet left a comment •

edited

Loading