-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Distributed training is slow #7455
Comments
Your network seems too small. Is it MNIST? It doesn't make sense to train MNIST on multiple machines |
@piiswrong , I am training inception-v3 using ImageNet dataset, so it's not a small network. |
How does it look for |
@szha , I have tried dist_sync_device, and I got almost the same result. For dist_async, it using the async SGD, i don't think it can be comparable. |
actually, the speed is decided by the slowest pc when you use "sync" mode. you can use "async" alternatively. "async" mode is the approximation on "sgd", but still can give a available result. |
Mind sharing a bit more on the machines (e.g. GPU types, homogenous or heterogeneous hardware, locality, nvlink, disk I/O speed, etc.)? |
@szha , every server has 8 TitanXp GPUs and 2 Intel Xeon CPU E5-2650 v2@ 2.60GHz. |
@starimpact , I have tried to use 4 servers per machine, I got almost the same result. |
I am using mxnet0.8.0, HAHAHA... |
Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem. |
Follow @solin319 's suggestion, I change the code, now the speed seems normal. Thanks @solin319 . @piiswrong @mli , it that a bug of mxnet? |
@leoxiaobin the speed-up can be attributed to the deletion of the barrier before send in kvstore, so it's unlikely a symptom of the bug. If you have to stay in the synchronous land, further increase to the batch size could help. Since switching to dist_sync_device doesn't help the speed, I guess the locality assumption for GPUs doesn't apply and the bandwidth among the GPUs may even be less than from your CPU to other machines (which is through IB). In this case the bottleneck is the GPU communication and IB won't help much, and you need higher ratio of compute to communication. |
@szha That is not entirely true - I agree with @solin319 that this WaitToRead should not be necessary (the actual communication is done in the lambda pushed to the engine that has send_buf as read dependency, so it will wait for it to be ready). What is more, this basically delays scheduling other copies from GPU to CPU for subsequent communications, thus limiting scaling. |
For context, that barrier was added since an operation such as:
would access memory that is not fully initialized and therefore causes a segfault. |
@madjam 's test case is that agree with @ptrendx that we should remove this WaitToRead. One solution is moving https://github.com/madjam/mxnet/blob/0012f7722d97238a84c33f1bee8cd2926707a7e9/src/kvstore/kvstore_dist.h#L221 into the captured function. Can someone help contribute a PR for it? |
in mxnet0.8.0 there is no "send_buf.WaitToReadd()". |
Environment info
Operating System: Ubuntu 16.4
Compiler: gcc 5.4
Package used (Python/R/Scala/Julia): Python
MXNet version: Last code
Or if installed from source: installed from source
MXNet commit hash (
git rev-parse HEAD
): 1a3faaIf you are using python package, please provide
Python version and distribution: Python 2.7.13 :: Anaconda custom (64-bit)
I tried to train image classification model using two servers with infiniband cards. But the speed is a little slow, just like using one server. I used the code of example/image-classifaction.
when training on one server, the training command is
the speed is
When training with 2 servers, the command is
And the speed is
It seems that using distributed training with 2 servers the speed is only a little better than standalone training(600x2 samples/sec VS 1000 samples/sec).
I tried to test the IB bandwidth using iperf, it can get 24.0 Gbits/sec using 1 thread, so I think the IB's bandwidth is not the bottleneck.
Does anyone can give any suggestion about distributed training using @@mxnet?
The text was updated successfully, but these errors were encountered: