Distributed training is slow #7455

leoxiaobin · 2017-08-14T14:51:41Z

Environment info

Operating System: Ubuntu 16.4

Compiler: gcc 5.4

Package used (Python/R/Scala/Julia): Python

MXNet version: Last code

Or if installed from source: installed from source

MXNet commit hash (git rev-parse HEAD): 1a3faa

If you are using python package, please provide

Python version and distribution: Python 2.7.13 :: Anaconda custom (64-bit)

I tried to train image classification model using two servers with infiniband cards. But the speed is a little slow, just like using one server. I used the code of example/image-classifaction.

when training on one server, the training command is

python train_imagenet.py --benchmark 1 --gpus 0,1,2,3,4,5,6,7 --kv-store device --network inception-v3 --batch-size 256   --image-shape 3,299,299

the speed is

INFO:root:start with arguments Namespace(batch_size=256, benchmark=1, data_nthreads=4, data_train=None, data_val=None, disp_batches=20, dtype='float32', gpus='0,1,2,3,4,5,6,7', image_shape='3,299,299', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='inception-v3', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[22:35:19] src/operator/././cudnn_algoreg-inl.h:112: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:35:40] src/kvstore/././comm.h:327: only 24 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[22:35:40] src/kvstore/././comm.h:336: .vvv....
[22:35:40] src/kvstore/././comm.h:336: v.vv....
[22:35:40] src/kvstore/././comm.h:336: vv.v....
[22:35:40] src/kvstore/././comm.h:336: vvv.....
[22:35:40] src/kvstore/././comm.h:336: .....vvv
[22:35:40] src/kvstore/././comm.h:336: ....v.vv
[22:35:40] src/kvstore/././comm.h:336: ....vv.v
[22:35:40] src/kvstore/././comm.h:336: ....vvv.
INFO:root:Epoch[0] Batch [20]   Speed: 1065.93 samples/sec      accuracy=0.165365
INFO:root:Epoch[0] Batch [40]   Speed: 1033.22 samples/sec      accuracy=0.989648
INFO:root:Epoch[0] Batch [60]   Speed: 1029.90 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 1029.80 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 1028.05 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 1019.75 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 1025.79 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 1027.82 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 1021.11 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 1025.14 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 1017.72 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [240]  Speed: 1021.09 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [260]  Speed: 1024.25 samples/sec      accuracy=1.000000

When training with 2 servers, the command is

 python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_imagenet.py --benchmark 1 --gpus 0,1,2,3,4,5,6,7 --kv-store dist_sync --network inception-v3 --num-layers 50 --batch-size 256 --sync-dst-dir /tmp/mxnet  --image-shape 3,299,299

And the speed is

INFO:root:Epoch[0] Batch [20]   Speed: 609.31 samples/sec       accuracy=0.056920
INFO:root:Epoch[0] Batch [20]   Speed: 610.12 samples/sec       accuracy=0.050967
INFO:root:Epoch[0] Batch [40]   Speed: 608.68 samples/sec       accuracy=0.854883
INFO:root:Epoch[0] Batch [40]   Speed: 608.19 samples/sec       accuracy=0.868164
INFO:root:Epoch[0] Batch [60]   Speed: 602.48 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [60]   Speed: 603.86 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 603.11 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 603.87 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 607.30 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 606.54 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 604.53 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 602.63 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 601.27 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 603.67 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 603.64 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 602.81 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 606.20 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 606.28 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 604.40 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 604.28 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 605.54 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 605.21 samples/sec       accuracy=1.000000

It seems that using distributed training with 2 servers the speed is only a little better than standalone training(600x2 samples/sec VS 1000 samples/sec).

I tried to test the IB bandwidth using iperf, it can get 24.0 Gbits/sec using 1 thread, so I think the IB's bandwidth is not the bottleneck.

Does anyone can give any suggestion about distributed training using @@mxnet?

The text was updated successfully, but these errors were encountered:

piiswrong · 2017-08-14T20:27:44Z

Your network seems too small. Is it MNIST? It doesn't make sense to train MNIST on multiple machines

leoxiaobin · 2017-08-15T00:49:21Z

@piiswrong , I am training inception-v3 using ImageNet dataset, so it's not a small network.

szha · 2017-08-15T01:15:35Z

How does it look for dist_sync_device? Or even dist_async?

leoxiaobin · 2017-08-15T03:07:32Z

@szha , I have tried dist_sync_device, and I got almost the same result. For dist_async, it using the async SGD, i don't think it can be comparable.

starimpact · 2017-08-15T04:31:30Z

actually, the speed is decided by the slowest pc when you use "sync" mode. you can use "async" alternatively. "async" mode is the approximation on "sgd", but still can give a available result.
And... 2 servers per pc is not a good choice. you can try set more servers on each pcs, such as 6. it will speed up obviously when the parameters are big.

szha · 2017-08-15T04:31:38Z

Mind sharing a bit more on the machines (e.g. GPU types, homogenous or heterogeneous hardware, locality, nvlink, disk I/O speed, etc.)?

leoxiaobin · 2017-08-15T06:47:48Z

@szha , every server has 8 TitanXp GPUs and 2 Intel Xeon CPU E5-2650 v2@ 2.60GHz.
The two servers are connected with IB cards.
The test is using --benchmark = 1 configuration, so there is no disk I/O operation.

leoxiaobin · 2017-08-15T06:48:45Z

@starimpact , I have tried to use 4 servers per machine, I got almost the same result.

starimpact · 2017-08-15T09:05:39Z

I am using mxnet0.8.0, HAHAHA...
I noticed that your "one server " is actually "local", because that your "kvstore=device". the kvstore will use gpu to update parameters.
And , your "two server" is really the distributed mode. in the "dist ..." mode, kvstore will use cpu to update the parameters.
So... your speed descending is normal.

solin319 · 2017-08-15T10:27:28Z

Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem.
The compute can't cover the time of data communicate in backward.

leoxiaobin · 2017-08-15T11:53:48Z

Follow @solin319 's suggestion, I change the code, now the speed seems normal. Thanks @solin319 .

@piiswrong @mli , it that a bug of mxnet?

szha · 2017-08-15T17:06:30Z

@leoxiaobin the speed-up can be attributed to the deletion of the barrier before send in kvstore, so it's unlikely a symptom of the bug. If you have to stay in the synchronous land, further increase to the batch size could help. Since switching to dist_sync_device doesn't help the speed, I guess the locality assumption for GPUs doesn't apply and the bandwidth among the GPUs may even be less than from your CPU to other machines (which is through IB). In this case the bottleneck is the GPU communication and IB won't help much, and you need higher ratio of compute to communication.

ptrendx · 2017-08-15T17:48:23Z

@szha That is not entirely true - I agree with @solin319 that this WaitToRead should not be necessary (the actual communication is done in the lambda pushed to the engine that has send_buf as read dependency, so it will wait for it to be ready). What is more, this basically delays scheduling other copies from GPU to CPU for subsequent communications, thus limiting scaling.
The PR introducing that line mentions crashes when using kvstore in imperative mode. I'm not familiar really how much does imperative way differs from symbolic as far as engine is concerned, but I don't think it should be that different that the dependencies stop working. This is definitely a bug.

szha · 2017-08-15T18:10:02Z

Thanks, @ptrendx. @madjam for more context.

madjam · 2017-08-15T21:22:25Z

For context, that barrier was added since an operation such as:

kv.init(2, mx.nd.zeros((50, 50)))

would access memory that is not fully initialized and therefore causes a segfault.

mli · 2017-08-15T22:43:58Z

@madjam 's test case is that send_buf maybe not ready to get data()

agree with @ptrendx that we should remove this WaitToRead. One solution is moving https://github.com/madjam/mxnet/blob/0012f7722d97238a84c33f1bee8cd2926707a7e9/src/kvstore/kvstore_dist.h#L221 into the captured function.

Can someone help contribute a PR for it?

mli · 2017-08-15T23:02:17Z

#6975

starimpact · 2017-08-16T01:42:34Z

in mxnet0.8.0 there is no "send_buf.WaitToReadd()".
lucky for me.^_^
https://github.com/starimpact/mxnet_v0.8.0/blob/bProxy_Weight/src/kvstore/kvstore_dist.h#L412
my mxnet support partial parameters update. welcome to use it.
haha....

leoxiaobin closed this as completed Aug 15, 2017

mli reopened this Aug 15, 2017

madjam pushed a commit to madjam/mxnet that referenced this issue Aug 16, 2017

Fixes scaling issue identified in apache#7455

521cf8b

mli mentioned this issue Aug 16, 2017

remove WaitToRead in dist-kvstore #7489

Merged

piiswrong closed this as completed in #7489 Aug 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training is slow #7455

Distributed training is slow #7455

leoxiaobin commented Aug 14, 2017 •

edited

Loading

piiswrong commented Aug 14, 2017

leoxiaobin commented Aug 15, 2017

szha commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

starimpact commented Aug 15, 2017 •

edited

Loading

szha commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

starimpact commented Aug 15, 2017 •

edited

Loading

solin319 commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

szha commented Aug 15, 2017

ptrendx commented Aug 15, 2017

szha commented Aug 15, 2017

madjam commented Aug 15, 2017

mli commented Aug 15, 2017

mli commented Aug 15, 2017

starimpact commented Aug 16, 2017 •

edited

Loading

Distributed training is slow #7455

Distributed training is slow #7455

Comments

leoxiaobin commented Aug 14, 2017 • edited Loading

Environment info

piiswrong commented Aug 14, 2017

leoxiaobin commented Aug 15, 2017

szha commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

starimpact commented Aug 15, 2017 • edited Loading

szha commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

starimpact commented Aug 15, 2017 • edited Loading

solin319 commented Aug 15, 2017

leoxiaobin commented Aug 15, 2017

szha commented Aug 15, 2017

ptrendx commented Aug 15, 2017

szha commented Aug 15, 2017

madjam commented Aug 15, 2017

mli commented Aug 15, 2017

mli commented Aug 15, 2017

starimpact commented Aug 16, 2017 • edited Loading

leoxiaobin commented Aug 14, 2017 •

edited

Loading

starimpact commented Aug 15, 2017 •

edited

Loading

starimpact commented Aug 15, 2017 •

edited

Loading

starimpact commented Aug 16, 2017 •

edited

Loading