(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

kloudkl · 2014-08-08T07:21:20Z

This is the first step tried to enable copying data among CPU and multiple GPUs which is required by #876. Of course, even there is only a single GPU, the non-blocking memory copying can still accelerate the speed.

The detailed theories under the hood are explained in "How to Overlap Data Transfers in CUDA C/C++".

Not tested on GPU yet.

amsword · 2014-08-08T11:08:50Z

Could you please conduct experiments on GPU? I have ever tried to implement the version of asynchronous data transfer from CPU to GPU, but finally given up.

The reason is that all the other calls on GPU is based on the default stream. One specification in the link of overlapping data is that:
No operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.

Thus, the non-default stream (data transfer stream) can not be paralleled with the default stream. In my implementation, I found although the time cost of data copy is reduced by the asychronous data transfer, the time cost of other routines are increased. In total, the overall time cost is almost the same.

To implement the asynchronous data transfer, the following three steps may be required.

Allocate a non-default stream to data transfer.
Allocate a different non-default stream to all the GPU computing units. This may require the modifications of all data layers.
Synchronize the asynchronous data transfer if the data is accessed by the computing unit.

kloudkl · 2014-10-13T15:42:37Z

Not sure if #1148 does this.

shelhamer · 2015-03-07T07:20:57Z

This has been handled separately by parallel branch(es). Expect PRs when everything is tuned and checks out.

Overlap the CUDA data transfers with computations with cudaMemcpyAsync

9c92b41

Yangqing changed the title ~~Overlap the CUDA data transfers with computations with cudaMemcpyAsync~~ (WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync Aug 8, 2014

shelhamer force-pushed the dev branch 3 times, most recently from 4278286 to c01f07a Compare August 28, 2014 07:00

shelhamer force-pushed the dev branch from 64258b6 to 403b56b Compare September 19, 2014 04:38

shelhamer force-pushed the dev branch from d8eb4df to 914da95 Compare October 8, 2014 16:36

cypof mentioned this pull request Oct 17, 2014

Parallel / distributed training #1148

Closed

6 tasks

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

shelhamer closed this Mar 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

kloudkl commented Aug 8, 2014

amsword commented Aug 8, 2014

kloudkl commented Oct 13, 2014

shelhamer commented Mar 7, 2015

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

(WIP) Overlap the CUDA data transfers with computations with cudaMemcpyAsync #884

Conversation

kloudkl commented Aug 8, 2014

amsword commented Aug 8, 2014

kloudkl commented Oct 13, 2014

shelhamer commented Mar 7, 2015