Skip to content

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

wants to merge 1 commit into from

Conversation

kloudkl
Copy link
Contributor

@kloudkl kloudkl commented Aug 8, 2014

This is the first step tried to enable copying data among CPU and multiple GPUs which is required by #876. Of course, even there is only a single GPU, the non-blocking memory copying can still accelerate the speed.

The detailed theories under the hood are explained in "How to Overlap Data Transfers in CUDA C/C++".

Not tested on GPU yet.

@amsword
Copy link

amsword commented Aug 8, 2014

Could you please conduct experiments on GPU? I have ever tried to implement the version of asynchronous data transfer from CPU to GPU, but finally given up.

The reason is that all the other calls on GPU is based on the default stream. One specification in the link of overlapping data is that:
No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

Thus, the non-default stream (data transfer stream) can not be paralleled with the default stream. In my implementation, I found although the time cost of data copy is reduced by the asychronous data transfer, the time cost of other routines are increased. In total, the overall time cost is almost the same.

To implement the asynchronous data transfer, the following three steps may be required.

  1. Allocate a non-default stream to data transfer.
  2. Allocate a different non-default stream to all the GPU computing units. This may require the modifications of all data layers.
  3. Synchronize the asynchronous data transfer if the data is accessed by the computing unit.

@Yangqing Yangqing changed the title Overlap the CUDA data transfers with computations with cudaMemcpyAsync (WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync Aug 8, 2014
@shelhamer shelhamer force-pushed the dev branch 3 times, most recently from 4278286 to c01f07a Compare August 28, 2014 07:00
@kloudkl
Copy link
Contributor Author

kloudkl commented Oct 13, 2014

Not sure if #1148 does this.

@shelhamer
Copy link
Member

This has been handled separately by parallel branch(es). Expect PRs when everything is tuned and checks out.

@shelhamer shelhamer closed this Mar 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants