Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cuStreamSync for async cusolver functs #215

Merged
merged 8 commits into from
Sep 2, 2022

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Jul 29, 2022

Signed-off-by: JackAKirk [email protected]

Description

This is a bug fix for failures first identified since the multi-streams implementation of the cuda backend in intel/llvm (failures identified here #209 (comment)):
The failed tests are due to the lack of a stream synchronisation after some cusolver interop functions such as cusolverdnsgetrf are called from lapack::cusolver::getrf. Since before the multistreams implementation all queues were effectively in-order using the cuda backend of intel/llvm, syncing streams returned from a queue that did not have the in_order queue property was not necessary.
The fix is to call:
cudaStream_t currentStreamId; CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, &currentStreamId); cuStreamSynchronize(currentStreamId);

after the cusolver functions. Since some cusolver functions are apparently asynchronous (and we can't know for sure from the docs which if any are not asynchronous), we have to synchronize the stream after it is used in cusolver calls.

@ericlars
Copy link
Contributor

ericlars commented Jul 30, 2022

One solution would be to only sync the stream used by the interop onemkl_cusolver_host_task by calling:

cudaStream_t currentStreamId;
            CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, &currentStreamId);
            cuStreamSynchronize(currentStreamId);

at the end of the onemkl_cusolver_host_task

Would this achieve asynchronous submissions?

However I find that in all cases so far it is faster (or the same) to simply call queue.wait() (which will sync all streams associoated with the queue) instead of using the above interop route to select the single offending stream.

I have only fixed the cases that led to the failed tests so far. However I think that the test coverage is probably just not using large enough problems in some other cases to trigger the failure: therefore it may be a good idea to always call queue.wait() after onemkl_cusolver_host_task is used.

If we go this route it may be roll the wait call into onemkl_cusolver_host_task. I made a temporary PR (#217) illustrating.

Note that is seems you cant rely on depends_on since this will not sync the stream used in the interop.

Interesting. Can you provide a reproducer?

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Aug 1, 2022

Would this achieve asynchronous submissions?

Actually I think just from the observation that it fixes the test failures it must be effectively blocking future submissions (at least those that are touching the same memory as the cusolver function) until the native stream used in the cusolver functions is finished working (It may be that we observe this blocking behaviour because the context could be being created with the CU_CTX_BLOCKING_SYNC flag: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gg9f889e28a45a295b5c8ce13aa05f6cd462aebfe6432ade3feb32f1a409027852): for the use cases we are interested in where the stream is touching memory that other streams could touch I think we actually want to call queue.wait() anyway because we don't want any stream to touch the memory until the native stream is finished.

Interesting. Can you provide a reproducer?

The reproducer is the getrf tests that are calling the getrf function that uses depend_on here: https://github.com/oneapi-src/oneMKL/blob/61312ed98b8208999f99474778d46919c30ef15b/src/lapack/backends/cusolver/cusolver_lapack.cpp#L1350

If the depends_on was syncing the stream then the corresponding tests wouldn't fail.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Aug 9, 2022

I've now added blocking waits (using queue.wait()) to all places where they were missing. I think that this is required for generally correct behaviour: we should block all streams in a queue from touching memory in a subsequent command group until an initial submitted command group that touches the same memory is complete.
In the cases where depends_on was used I think that the expectation is that this should have achieved a blocking wait, so I'll set up an internal issue to investigate further why depends_on is apparently not working when used with a host task.

@JackAKirk JackAKirk marked this pull request as draft August 9, 2022 11:25
@JackAKirk
Copy link
Contributor Author

I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.

Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().

@JackAKirk JackAKirk marked this pull request as ready for review August 9, 2022 14:13
@JackAKirk
Copy link
Contributor Author

I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.

Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().

I've done this now.

@JackAKirk JackAKirk marked this pull request as draft August 18, 2022 10:26
@JackAKirk JackAKirk changed the title add queue.wait() to sync interop stream add cuStreamSync for async cusolver functs Aug 19, 2022
@JackAKirk JackAKirk marked this pull request as ready for review August 19, 2022 14:17
@JackAKirk
Copy link
Contributor Author

@AidanBeltonS could you check this is all OK? Thanks

@AidanBeltonS
Copy link
Contributor

@AidanBeltonS could you check this is all OK? Thanks

LGTM.
I think this approach is the simplest for handling cuSolver function asynchronous behaviour.

@AidanBeltonS
Copy link
Contributor

@ericlars what do you think this solution?
I believe this is necessary as reading through the cuSolver docs it seems like these operations execute asynchronously. Although in most cases they don't. Docs: https://docs.nvidia.com/cuda/cusolver/index.html#asynchronous-execution

This means the host_task can finish before the cuSolver operation has completed. Calling a synchronising step after cuSolver will keep behaviour as expected. While we could synchronise after the host_task this will have issues if additional CUDA code needs to be placed after the cuSolver call which has a dependency on the cuSolver's output.

Copy link
Contributor

@ericlars ericlars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response, I've been on an extended vacation. This looks like a really elegant solution, thanks for working on it. I have a better appreciation for the difficulties of asynchronicity and cuda now.

@ericlars
Copy link
Contributor

ericlars commented Sep 2, 2022

attaching log: log_llvm_cusolver_.txt

@ericlars ericlars merged commit 634d7d2 into uxlfoundation:develop Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants