add cuStreamSync for async cusolver functs #215

JackAKirk · 2022-07-29T17:19:49Z

Signed-off-by: JackAKirk [email protected]

Description

This is a bug fix for failures first identified since the multi-streams implementation of the cuda backend in intel/llvm (failures identified here #209 (comment)):
The failed tests are due to the lack of a stream synchronisation after some cusolver interop functions such as cusolverdnsgetrf are called from lapack::cusolver::getrf. Since before the multistreams implementation all queues were effectively in-order using the cuda backend of intel/llvm, syncing streams returned from a queue that did not have the in_order queue property was not necessary.
The fix is to call:
cudaStream_t currentStreamId; CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, &currentStreamId); cuStreamSynchronize(currentStreamId);

after the cusolver functions. Since some cusolver functions are apparently asynchronous (and we can't know for sure from the docs which if any are not asynchronous), we have to synchronize the stream after it is used in cusolver calls.

Signed-off-by: JackAKirk <[email protected]>

ericlars · 2022-07-30T01:35:08Z

One solution would be to only sync the stream used by the interop onemkl_cusolver_host_task by calling:

cudaStream_t currentStreamId;
            CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, &currentStreamId);
            cuStreamSynchronize(currentStreamId);

at the end of the onemkl_cusolver_host_task

Would this achieve asynchronous submissions?

However I find that in all cases so far it is faster (or the same) to simply call queue.wait() (which will sync all streams associoated with the queue) instead of using the above interop route to select the single offending stream.

I have only fixed the cases that led to the failed tests so far. However I think that the test coverage is probably just not using large enough problems in some other cases to trigger the failure: therefore it may be a good idea to always call queue.wait() after onemkl_cusolver_host_task is used.

If we go this route it may be roll the wait call into onemkl_cusolver_host_task. I made a temporary PR (#217) illustrating.

Note that is seems you cant rely on depends_on since this will not sync the stream used in the interop.

Interesting. Can you provide a reproducer?

JackAKirk · 2022-08-01T17:59:12Z

Would this achieve asynchronous submissions?

Actually I think just from the observation that it fixes the test failures it must be effectively blocking future submissions (at least those that are touching the same memory as the cusolver function) until the native stream used in the cusolver functions is finished working (It may be that we observe this blocking behaviour because the context could be being created with the CU_CTX_BLOCKING_SYNC flag: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gg9f889e28a45a295b5c8ce13aa05f6cd462aebfe6432ade3feb32f1a409027852): for the use cases we are interested in where the stream is touching memory that other streams could touch I think we actually want to call queue.wait() anyway because we don't want any stream to touch the memory until the native stream is finished.

Interesting. Can you provide a reproducer?

The reproducer is the getrf tests that are calling the getrf function that uses depend_on here: https://github.com/oneapi-src/oneMKL/blob/61312ed98b8208999f99474778d46919c30ef15b/src/lapack/backends/cusolver/cusolver_lapack.cpp#L1350

If the depends_on was syncing the stream then the corresponding tests wouldn't fail.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2022-08-09T10:22:29Z

I've now added blocking waits (using queue.wait()) to all places where they were missing. I think that this is required for generally correct behaviour: we should block all streams in a queue from touching memory in a subsequent command group until an initial submitted command group that touches the same memory is complete.
In the cases where depends_on was used I think that the expectation is that this should have achieved a blocking wait, so I'll set up an internal issue to investigate further why depends_on is apparently not working when used with a host task.

JackAKirk · 2022-08-09T11:32:34Z

I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.

Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2022-08-09T14:22:56Z

I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.

Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().

I've done this now.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2022-08-19T14:17:50Z

@AidanBeltonS could you check this is all OK? Thanks

AidanBeltonS · 2022-08-22T08:31:22Z

@AidanBeltonS could you check this is all OK? Thanks

LGTM.
I think this approach is the simplest for handling cuSolver function asynchronous behaviour.

AidanBeltonS · 2022-08-24T14:26:39Z

@ericlars what do you think this solution?
I believe this is necessary as reading through the cuSolver docs it seems like these operations execute asynchronously. Although in most cases they don't. Docs: https://docs.nvidia.com/cuda/cusolver/index.html#asynchronous-execution

This means the host_task can finish before the cuSolver operation has completed. Calling a synchronising step after cuSolver will keep behaviour as expected. While we could synchronise after the host_task this will have issues if additional CUDA code needs to be placed after the cuSolver call which has a dependency on the cuSolver's output.

ericlars

Apologies for the delayed response, I've been on an extended vacation. This looks like a really elegant solution, thanks for working on it. I have a better appreciation for the difficulties of asynchronicity and cuda now.

ericlars · 2022-09-02T08:15:18Z

attaching log: log_llvm_cusolver_.txt

JackAKirk added 2 commits July 29, 2022 18:00

add queue.wait() to sync interop stream

953f195

Signed-off-by: JackAKirk <[email protected]>

remove now redundant depends_on

fe780c8

Signed-off-by: JackAKirk <[email protected]>

JackAKirk mentioned this pull request Jul 29, 2022

[LAPACK][CUSOLVER] Add potrf and getrs batch functions to cuSolver #209

Merged

ericlars self-assigned this Jul 29, 2022

This was referenced Jul 29, 2022

[CUSOLVER] cuSOLVER handler does not support multiple streams #216

Closed

[CUSOLVER] workaround for issue #216 #217

Closed

Adds queue.wait() in required places.

ca29fa6

Signed-off-by: JackAKirk <[email protected]>

JackAKirk marked this pull request as draft August 9, 2022 11:25

JackAKirk added 2 commits August 9, 2022 14:55

Switch to depend_on and native sync.

5fcdf77

Signed-off-by: JackAKirk <[email protected]>

Missing addition from previous commit.

b4dd91d

Signed-off-by: JackAKirk <[email protected]>

JackAKirk marked this pull request as ready for review August 9, 2022 14:13

JackAKirk marked this pull request as draft August 18, 2022 10:26

JackAKirk added 2 commits August 19, 2022 13:45

Introduced CUSOLVER_ERROR_FUNC_T_SYNC for async cusolver functs.

3934f19

Signed-off-by: JackAKirk <[email protected]>

removed unnecessarily added depends_on.

30c9acb

Signed-off-by: JackAKirk <[email protected]>

JackAKirk changed the title ~~add queue.wait() to sync interop stream~~ add cuStreamSync for async cusolver functs Aug 19, 2022

Merge branch 'develop' into fix-stream-sync

e5b116f

JackAKirk marked this pull request as ready for review August 19, 2022 14:17

ericlars approved these changes Sep 2, 2022

View reviewed changes

jadurazo approved these changes Sep 2, 2022

View reviewed changes

ericlars merged commit 634d7d2 into uxlfoundation:develop Sep 2, 2022

mmeterel mentioned this pull request Sep 14, 2022

[BLAS - cuBLAS] Fix stream synchronization #228

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cuStreamSync for async cusolver functs #215

add cuStreamSync for async cusolver functs #215

JackAKirk commented Jul 29, 2022 •

edited

Loading

ericlars commented Jul 30, 2022 •

edited

Loading

JackAKirk commented Aug 1, 2022 •

edited

Loading

JackAKirk commented Aug 9, 2022 •

edited

Loading

JackAKirk commented Aug 9, 2022

JackAKirk commented Aug 9, 2022

JackAKirk commented Aug 19, 2022

AidanBeltonS commented Aug 22, 2022

AidanBeltonS commented Aug 24, 2022

ericlars left a comment

ericlars commented Sep 2, 2022

add cuStreamSync for async cusolver functs #215

add cuStreamSync for async cusolver functs #215

Conversation

JackAKirk commented Jul 29, 2022 • edited Loading

Description

ericlars commented Jul 30, 2022 • edited Loading

JackAKirk commented Aug 1, 2022 • edited Loading

JackAKirk commented Aug 9, 2022 • edited Loading

JackAKirk commented Aug 9, 2022

JackAKirk commented Aug 9, 2022

JackAKirk commented Aug 19, 2022

AidanBeltonS commented Aug 22, 2022

AidanBeltonS commented Aug 24, 2022

ericlars left a comment

Choose a reason for hiding this comment

ericlars commented Sep 2, 2022

JackAKirk commented Jul 29, 2022 •

edited

Loading

ericlars commented Jul 30, 2022 •

edited

Loading

JackAKirk commented Aug 1, 2022 •

edited

Loading

JackAKirk commented Aug 9, 2022 •

edited

Loading