a problem in distribute training #6975

solin319 · 2017-07-10T07:34:44Z

Environment info

Operating System:ubuntu14.04

Package used (Python/R/Scala/Julia):Python

MXNet version:0.10.1

Or if installed from source:install from source

If you are using python package, please provide

Python version and distribution:python 2.7

Problem Message:

When I train a model in distribute computers in MXNet-0.10.1, I find it slower than train it in MXNet-0.9.4.
I use profile to analyze the program. In MXnet-0.10.1, I find the grad_arrays were pushed to kv-store after all the backward layers compute finished. So the compute can't cover the time of data communicate.
In MXNet-0.9.4 the grad_array in each convolution layer will push to kv-store directly after this layers' backward compute.
Why do I meet this problem and how to resolve it?

solin319 · 2017-07-11T03:08:03Z

Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem.
The compute can't cover the time of data communicate in backward.

eric-haibin-lin · 2017-07-11T20:30:02Z

I don't know why the following chunk of code is not moved inside PushAsync(lambda, ...)

// push to servers
   size_t size = send_buf.shape().Size();
   #if MKL_EXPERIMENTAL == 1
   mkl_set_tblob_eager_mode(send_buf.data());
   #endif
   real_t* data = static_cast<real_t*>(send_buf.data().dptr_)

If it is moved inside the lambda, the WaitToRead() call is not necessary.
Otherwise you need to wait to read since you're accessing send_buf.data()

solin319 · 2017-07-12T00:56:53Z

Why when add waittoread() , all the param_arrays will be clocked until all the backward convolution finished? The param in one layer can't push to kvstore directly after this layers' backward compute.
We can Observe this phenomenon by using profiler when training a image classificcation model,such as resnet.
The 'data' used in lamda is a real_t pointer. In the PushAsync, 'send_buf' is a const_var. It will wait send_buf to read before process the lamda function. So when the program process the lamda function the pointer 'data' can point to sendbuf.data() which can be read. Why we need add 'send_buf.WaitToRead();' before?

idealboy · 2017-08-10T02:22:53Z

sir, will you instruct me for distribute training on two machines, thank you very much. I dit it according the official document, but it did not work on two machines

eric-haibin-lin · 2017-10-15T02:52:14Z

@idealboy There's an example for running dist training here https://mxnet.incubator.apache.org/how_to/multi_devices.html
Please post your code / error message if it doesn't work for you.

The original issue should be resolved now with #7489 so I'm closing it for now.

For further discussions/questions, we're moving to https://discuss.mxnet.io/

mli mentioned this issue Aug 15, 2017

Distributed training is slow #7455

Closed

eric-haibin-lin closed this as completed Oct 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a problem in distribute training #6975

a problem in distribute training #6975

solin319 commented Jul 10, 2017

solin319 commented Jul 11, 2017

eric-haibin-lin commented Jul 11, 2017 •

edited

Loading

solin319 commented Jul 12, 2017

idealboy commented Aug 10, 2017

eric-haibin-lin commented Oct 15, 2017

a problem in distribute training #6975

a problem in distribute training #6975

Comments

solin319 commented Jul 10, 2017

Environment info

Problem Message:

solin319 commented Jul 11, 2017

eric-haibin-lin commented Jul 11, 2017 • edited Loading

solin319 commented Jul 12, 2017

idealboy commented Aug 10, 2017

eric-haibin-lin commented Oct 15, 2017

eric-haibin-lin commented Jul 11, 2017 •

edited

Loading