Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] add option for per-thread default stream #354

Merged
merged 7 commits into from
May 1, 2020

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Apr 22, 2020

This change adds the option to build RMM with --default-stream per-thread enabled.

Since RMM doesn't use nvcc for non-test code, this is actually done by passing
-DCUDA_API_PER_THREAD_DEFAULT_STREAM to gcc.

This is the alternative solution to #352.

By default the option is disabled. To enable it:

cmake .. -DPER_THREAD_DEFAULT_STREAM=ON ...

I've tested this manually on some spark jobs, which use the CNMeM memory resource. CNMeM should work based on this line (https://github.com/NVIDIA/cnmem/blob/master/src/cnmem.cpp#L386).

The corresponding cuDF change is rapidsai/cudf#4995.

@harrism @jrhemstad @revans2 @jlowe

@rongou rongou requested a review from a team as a code owner April 22, 2020 23:34
@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

@revans2
Copy link
Contributor

revans2 commented Apr 23, 2020

Is this going to work? cnmem is playing games with grouping allocations per stream and only synchronizing streams when something moves between the pools. With per-thread default stream will cnmem still be able to maintain consistency when it thinks everything is on stream 0, even though it is not?

@rongou
Copy link
Contributor Author

rongou commented Apr 23, 2020

@harrism
Copy link
Member

harrism commented Apr 23, 2020

Wow, I never noticed that... Interesting. But do you really want to synchronize on EVERY allocation?

@rongou
Copy link
Contributor Author

rongou commented Apr 23, 2020

The current CNMeM implementation is definitely not optimized for PTDS. Once we have it enabled, we can probably try to improve it. Also the current algorithm may not be the best at reducing fragmentation. We might want to take a page from jemalloc (http://jemalloc.net/).

@harrism
Copy link
Member

harrism commented Apr 23, 2020

I think we prefer not to put energy into cnmem. Plans are to keep improving RMM's device_memory_resource classes and remove cnmem. E.g. pool_memory_resource is already better than cnmem (except it doesn't have the hack to enable PTDS).

jemalloc has a lot of pages. Which one are you referring to?

@revans2
Copy link
Contributor

revans2 commented Apr 23, 2020

I think we can make PTDS work if we use events and event synchronization instead of stream synchronization. in the allocator.

The issue we are trying to protect against is

  1. stream A allocates some memory
  2. launches an async kernel to put a result in that memory
  3. launches another async op to copy that data out of that memory
  4. frees the memory
  5. Stream B allocates some memory (and is handed back part of the allocation for A)
  6. stream B writes something into that memory
  7. Stream A operations complete and the data is corrupted.

Cuda protects against this when a free happens essentially synchronizing on the device before the memory is allocated again.

If we know that memory was intended to be used on stream A when we free the memory we can insert an event into stream A and then not hand that memory to another stream unless the event has completed.

The hack for PTDS would be to treat all allocations as being on different streams because we just don't know. This opens things up with Mark's pooling allocator to be able to walk through the free list and look for one where the event has completed by polling instead of blocking, and then only block if there are none that are free. This should make the common case very fast, even with PTDS.

@jrhemstad
Copy link
Contributor

I think we can make PTDS work if we use events and event synchronization instead of stream synchronization in the allocator.

Yep, this is what @harrism and I have talked about. You create and enqueue an event every time a block of memory is free'd and associate it with that block. Then you need to wait on that event before reclaiming it.

There were some additional warts with any time you want to coalesce a block with other blocks, you need to wait on their events, which introduces more synchronization and can cause freeing to no longer be asynchronous.

It's certainly possible, just haven't done it yet.

@leofang
Copy link
Member

leofang commented Apr 23, 2020

Just a drive-by question, perhaps @jakirkham has given some thoughts: what happens if rmm is PTDS-enabled but other libraries aren't? This could happen in the Python world when, say, coupling rmm to CuPy.

@kkraus14
Copy link
Contributor

Just a drive-by question, perhaps @jakirkham has given some thoughts: what happens if rmm is PTDS-enabled but other libraries aren't? This could happen in the Python world when, say, coupling rmm to CuPy.

We haven't crossed that road quite yet, but I imagine it will cause issues 😅

@rongou
Copy link
Contributor Author

rongou commented Apr 23, 2020

I'm not sure how expensive it is to create and destroy events. Maybe we need an event pool. :)

@jrhemstad
Copy link
Contributor

I'm not sure how expensive it is to create and destroy events. Maybe we need an event pool. :)

Creating and enqueuing events is "free". Waiting on them is not since it's a synchronization.

@jakirkham
Copy link
Member

Thanks Leo! Yeah generally aware this is happening, but I don't think we are planning on using this for Python yet (as Keith said).

@rongou rongou changed the title add option for per-thread default stream [REVIEW] add option for per-thread default stream Apr 24, 2020
@rongou
Copy link
Contributor Author

rongou commented Apr 24, 2020

@harrism added the option to tests and benchmarks. Please take another look. This PR is pretty innocuous. Do you think we can get it merged before attempting more sophisticated PTDS support?

As for jemalloc, a couple of things we can borrow are per-thread arenas that allocate small blocks so they don't have to lock, and perhaps first-fit for larger blocks.

@jakirkham
Copy link
Member

Does CNMeM have any global state? If not, would it be possible to just use a different CNMeM pool per thread?

@rongou
Copy link
Contributor Author

rongou commented Apr 25, 2020

Yes if 1/n the memory is large enough. Stealing memory from other threads becomes tricky though.

@jakirkham
Copy link
Member

Does the pointer to a memory allocation remain the same if it crosses threads when PTDS is enabled? Or does PTDS affect how memory is addressed per thread?

@revans2
Copy link
Contributor

revans2 commented Apr 25, 2020

1/n the memory is not acceptable. First of all there is data skew and we don't want to hard partition the memory like that. Java uses a huge number of threads. For spark, in particular, we schedule more threads than can fit on the GPU so that we can overlap I/O on the CPU with computation on the GPU. To support this scheme it would be a massive undertaking to try and shoehorn sparks threading model into the proposal.

@harrism
Copy link
Member

harrism commented Apr 27, 2020

Please add a changelog entry if you want CI to run. Otherwise we'll get nowhere. :)

@jlowe
Copy link
Contributor

jlowe commented Apr 27, 2020

Does the pointer to a memory allocation remain the same if it crosses threads when PTDS is enabled?

Yes, PTDS does not create a separate address space for threads, rather just a separate, asynchronous CUDA stream per thread. Device address mappings remain the same for the entire process

However exchanging device memory addresses between threads will require application-level event or stream synchronization to be safe, since separate threads will not be issuing to the same stream as they do today without PTDS.

@rongou
Copy link
Contributor Author

rongou commented Apr 27, 2020

@harrism I added this PR to the changelog, but looks like the CI is still not running. Do you need to whitelist it?

@jrhemstad
Copy link
Contributor

add to whitelist

@kkraus14
Copy link
Contributor

rerun tests

Co-Authored-By: Keith Kraus <[email protected]>
@kkraus14
Copy link
Contributor

add to whitelist

@harrism harrism merged commit fdb364a into rapidsai:branch-0.14 May 1, 2020
@rongou rongou deleted the per-thread-default-stream branch May 21, 2020 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants