[REVIEW] add option for per-thread default stream #354

rongou · 2020-04-22T23:34:42Z

This change adds the option to build RMM with --default-stream per-thread enabled.

Since RMM doesn't use nvcc for non-test code, this is actually done by passing
-DCUDA_API_PER_THREAD_DEFAULT_STREAM to gcc.

This is the alternative solution to #352.

By default the option is disabled. To enable it:

cmake .. -DPER_THREAD_DEFAULT_STREAM=ON ...

I've tested this manually on some spark jobs, which use the CNMeM memory resource. CNMeM should work based on this line (https://github.com/NVIDIA/cnmem/blob/master/src/cnmem.cpp#L386).

The corresponding cuDF change is rapidsai/cudf#4995.

@harrism @jrhemstad @revans2 @jlowe

GPUtester · 2020-04-22T23:34:44Z

Can one of the admins verify this patch?

revans2 · 2020-04-23T00:49:41Z

Is this going to work? cnmem is playing games with grouping allocations per stream and only synchronizing streams when something moves between the pools. With per-thread default stream will cnmem still be able to maintain consistency when it thinks everything is on stream 0, even though it is not?

rongou · 2020-04-23T01:11:53Z

CNMeM recognizes this option: https://github.com/NVIDIA/cnmem/blob/master/src/cnmem.cpp#L386
and does extra synchronization: https://github.com/NVIDIA/cnmem/blob/master/src/cnmem.cpp#L467.

harrism · 2020-04-23T01:21:53Z

Wow, I never noticed that... Interesting. But do you really want to synchronize on EVERY allocation?

rongou · 2020-04-23T06:29:53Z

The current CNMeM implementation is definitely not optimized for PTDS. Once we have it enabled, we can probably try to improve it. Also the current algorithm may not be the best at reducing fragmentation. We might want to take a page from jemalloc (http://jemalloc.net/).

harrism · 2020-04-23T06:31:34Z

I think we prefer not to put energy into cnmem. Plans are to keep improving RMM's device_memory_resource classes and remove cnmem. E.g. pool_memory_resource is already better than cnmem (except it doesn't have the hack to enable PTDS).

jemalloc has a lot of pages. Which one are you referring to?

revans2 · 2020-04-23T12:05:21Z

I think we can make PTDS work if we use events and event synchronization instead of stream synchronization. in the allocator.

The issue we are trying to protect against is

stream A allocates some memory
launches an async kernel to put a result in that memory
launches another async op to copy that data out of that memory
frees the memory
Stream B allocates some memory (and is handed back part of the allocation for A)
stream B writes something into that memory
Stream A operations complete and the data is corrupted.

Cuda protects against this when a free happens essentially synchronizing on the device before the memory is allocated again.

If we know that memory was intended to be used on stream A when we free the memory we can insert an event into stream A and then not hand that memory to another stream unless the event has completed.

The hack for PTDS would be to treat all allocations as being on different streams because we just don't know. This opens things up with Mark's pooling allocator to be able to walk through the free list and look for one where the event has completed by polling instead of blocking, and then only block if there are none that are free. This should make the common case very fast, even with PTDS.

jrhemstad · 2020-04-23T13:06:39Z

I think we can make PTDS work if we use events and event synchronization instead of stream synchronization in the allocator.

Yep, this is what @harrism and I have talked about. You create and enqueue an event every time a block of memory is free'd and associate it with that block. Then you need to wait on that event before reclaiming it.

There were some additional warts with any time you want to coalesce a block with other blocks, you need to wait on their events, which introduces more synchronization and can cause freeing to no longer be asynchronous.

It's certainly possible, just haven't done it yet.

leofang · 2020-04-23T13:47:43Z

Just a drive-by question, perhaps @jakirkham has given some thoughts: what happens if rmm is PTDS-enabled but other libraries aren't? This could happen in the Python world when, say, coupling rmm to CuPy.

kkraus14 · 2020-04-23T15:05:11Z

Just a drive-by question, perhaps @jakirkham has given some thoughts: what happens if rmm is PTDS-enabled but other libraries aren't? This could happen in the Python world when, say, coupling rmm to CuPy.

We haven't crossed that road quite yet, but I imagine it will cause issues 😅

rongou · 2020-04-23T16:28:11Z

I'm not sure how expensive it is to create and destroy events. Maybe we need an event pool. :)

jrhemstad · 2020-04-23T16:33:36Z

I'm not sure how expensive it is to create and destroy events. Maybe we need an event pool. :)

Creating and enqueuing events is "free". Waiting on them is not since it's a synchronization.

jakirkham · 2020-04-23T20:08:39Z

Thanks Leo! Yeah generally aware this is happening, but I don't think we are planning on using this for Python yet (as Keith said).

rongou · 2020-04-24T18:18:58Z

@harrism added the option to tests and benchmarks. Please take another look. This PR is pretty innocuous. Do you think we can get it merged before attempting more sophisticated PTDS support?

As for jemalloc, a couple of things we can borrow are per-thread arenas that allocate small blocks so they don't have to lock, and perhaps first-fit for larger blocks.

jakirkham · 2020-04-24T20:13:47Z

Does CNMeM have any global state? If not, would it be possible to just use a different CNMeM pool per thread?

rongou · 2020-04-25T00:52:00Z

Yes if 1/n the memory is large enough. Stealing memory from other threads becomes tricky though.

jakirkham · 2020-04-25T02:40:52Z

Does the pointer to a memory allocation remain the same if it crosses threads when PTDS is enabled? Or does PTDS affect how memory is addressed per thread?

revans2 · 2020-04-25T02:49:43Z

1/n the memory is not acceptable. First of all there is data skew and we don't want to hard partition the memory like that. Java uses a huge number of threads. For spark, in particular, we schedule more threads than can fit on the GPU so that we can overlap I/O on the CPU with computation on the GPU. To support this scheme it would be a massive undertaking to try and shoehorn sparks threading model into the proposal.

harrism · 2020-04-27T05:27:16Z

Please add a changelog entry if you want CI to run. Otherwise we'll get nowhere. :)

jlowe · 2020-04-27T13:38:28Z

Does the pointer to a memory allocation remain the same if it crosses threads when PTDS is enabled?

Yes, PTDS does not create a separate address space for threads, rather just a separate, asynchronous CUDA stream per thread. Device address mappings remain the same for the entire process

However exchanging device memory addresses between threads will require application-level event or stream synchronization to be safe, since separate threads will not be issuing to the same stream as they do today without PTDS.

rongou · 2020-04-27T17:05:06Z

@harrism I added this PR to the changelog, but looks like the CI is still not running. Do you need to whitelist it?

jrhemstad · 2020-04-27T17:29:57Z

add to whitelist

kkraus14 · 2020-04-27T20:07:09Z

rerun tests

CHANGELOG.md

Co-Authored-By: Keith Kraus <[email protected]>

kkraus14 · 2020-04-28T00:08:07Z

add to whitelist

benchmarks/CMakeLists.txt

add option for per-thread default stream

68d34af

rongou requested a review from a team as a code owner April 22, 2020 23:34

rongou mentioned this pull request Apr 22, 2020

[REVIEW] Add CMake option for per-thread default stream rapidsai/cudf#4995

Merged

add ptds option to tests and benchmarks

4bb9b42

rongou changed the title ~~add option for per-thread default stream~~ [REVIEW] add option for per-thread default stream Apr 24, 2020

add to changelog

b0c67d4

kkraus14 reviewed Apr 27, 2020

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

Update CHANGELOG.md

3ca0980

Co-Authored-By: Keith Kraus <[email protected]>

harrism requested changes Apr 28, 2020

View reviewed changes

benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved

rongou added 3 commits April 28, 2020 10:26

Merge branch 'branch-0.14' into per-thread-default-stream

4f58d9a

define ptds option once

84abf9e

remove redundant option

e726913

rongou mentioned this pull request Apr 28, 2020

support per-thread default stream NVIDIA/thrust#1128

Merged

rongou requested a review from harrism April 30, 2020 00:58

harrism approved these changes May 1, 2020

View reviewed changes

harrism merged commit fdb364a into rapidsai:branch-0.14 May 1, 2020

rongou deleted the per-thread-default-stream branch May 21, 2020 21:10

jakirkham mentioned this pull request Jul 2, 2020

Support for CUDA streams rapidsai/dask-cuda#96

Open

rongou mentioned this pull request Jul 2, 2020

[FEA] Multiple threads sharing the same GPU NVIDIA/spark-rapids#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] add option for per-thread default stream #354

[REVIEW] add option for per-thread default stream #354

rongou commented Apr 22, 2020 •

edited

Loading

GPUtester commented Apr 22, 2020

revans2 commented Apr 23, 2020

rongou commented Apr 23, 2020

harrism commented Apr 23, 2020

rongou commented Apr 23, 2020

harrism commented Apr 23, 2020 •

edited

Loading

revans2 commented Apr 23, 2020

jrhemstad commented Apr 23, 2020

leofang commented Apr 23, 2020

kkraus14 commented Apr 23, 2020

rongou commented Apr 23, 2020

jrhemstad commented Apr 23, 2020

jakirkham commented Apr 23, 2020

rongou commented Apr 24, 2020

jakirkham commented Apr 24, 2020

rongou commented Apr 25, 2020

jakirkham commented Apr 25, 2020

revans2 commented Apr 25, 2020

harrism commented Apr 27, 2020

jlowe commented Apr 27, 2020

rongou commented Apr 27, 2020

jrhemstad commented Apr 27, 2020

kkraus14 commented Apr 27, 2020

kkraus14 commented Apr 28, 2020

[REVIEW] add option for per-thread default stream #354

[REVIEW] add option for per-thread default stream #354

Conversation

rongou commented Apr 22, 2020 • edited Loading

GPUtester commented Apr 22, 2020

revans2 commented Apr 23, 2020

rongou commented Apr 23, 2020

harrism commented Apr 23, 2020

rongou commented Apr 23, 2020

harrism commented Apr 23, 2020 • edited Loading

revans2 commented Apr 23, 2020

jrhemstad commented Apr 23, 2020

leofang commented Apr 23, 2020

kkraus14 commented Apr 23, 2020

rongou commented Apr 23, 2020

jrhemstad commented Apr 23, 2020

jakirkham commented Apr 23, 2020

rongou commented Apr 24, 2020

jakirkham commented Apr 24, 2020

rongou commented Apr 25, 2020

jakirkham commented Apr 25, 2020

revans2 commented Apr 25, 2020

harrism commented Apr 27, 2020

jlowe commented Apr 27, 2020

rongou commented Apr 27, 2020

jrhemstad commented Apr 27, 2020

kkraus14 commented Apr 27, 2020

kkraus14 commented Apr 28, 2020

rongou commented Apr 22, 2020 •

edited

Loading

harrism commented Apr 23, 2020 •

edited

Loading