Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

a problem in distribute training #6975

Closed
solin319 opened this issue Jul 10, 2017 · 5 comments
Closed

a problem in distribute training #6975

solin319 opened this issue Jul 10, 2017 · 5 comments

Comments

@solin319
Copy link
Contributor

Environment info

Operating System:ubuntu14.04

Package used (Python/R/Scala/Julia):Python

MXNet version:0.10.1

Or if installed from source:install from source

If you are using python package, please provide

Python version and distribution:python 2.7

Problem Message:

  1. When I train a model in distribute computers in MXNet-0.10.1, I find it slower than train it in MXNet-0.9.4.
  2. I use profile to analyze the program. In MXnet-0.10.1, I find the grad_arrays were pushed to kv-store after all the backward layers compute finished. So the compute can't cover the time of data communicate.
  3. In MXNet-0.9.4 the grad_array in each convolution layer will push to kv-store directly after this layers' backward compute.
    Why do I meet this problem and how to resolve it?
@solin319
Copy link
Contributor Author

Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem.
The compute can't cover the time of data communicate in backward.

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Jul 11, 2017

I don't know why the following chunk of code is not moved inside PushAsync(lambda, ...)

// push to servers
   size_t size = send_buf.shape().Size();
   #if MKL_EXPERIMENTAL == 1
   mkl_set_tblob_eager_mode(send_buf.data());
   #endif
   real_t* data = static_cast<real_t*>(send_buf.data().dptr_)

If it is moved inside the lambda, the WaitToRead() call is not necessary.
Otherwise you need to wait to read since you're accessing send_buf.data()

@solin319
Copy link
Contributor Author

  1. Why when add waittoread() , all the param_arrays will be clocked until all the backward convolution finished? The param in one layer can't push to kvstore directly after this layers' backward compute.
    We can Observe this phenomenon by using profiler when training a image classificcation model,such as resnet.
  2. The 'data' used in lamda is a real_t pointer. In the PushAsync, 'send_buf' is a const_var. It will wait send_buf to read before process the lamda function. So when the program process the lamda function the pointer 'data' can point to sendbuf.data() which can be read. Why we need add 'send_buf.WaitToRead();' before?

@idealboy
Copy link

sir, will you instruct me for distribute training on two machines, thank you very much. I dit it according the official document, but it did not work on two machines

@eric-haibin-lin
Copy link
Member

@idealboy There's an example for running dist training here https://mxnet.incubator.apache.org/how_to/multi_devices.html
Please post your code / error message if it doesn't work for you.

The original issue should be resolved now with #7489 so I'm closing it for now.

For further discussions/questions, we're moving to https://discuss.mxnet.io/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants