Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Distributed training is slow #7455

Closed
leoxiaobin opened this issue Aug 14, 2017 · 18 comments · Fixed by #7489
Closed

Distributed training is slow #7455

leoxiaobin opened this issue Aug 14, 2017 · 18 comments · Fixed by #7489

Comments

@leoxiaobin
Copy link

leoxiaobin commented Aug 14, 2017

Environment info

Operating System: Ubuntu 16.4

Compiler: gcc 5.4

Package used (Python/R/Scala/Julia): Python

MXNet version: Last code

Or if installed from source: installed from source

MXNet commit hash (git rev-parse HEAD): 1a3faa

If you are using python package, please provide

Python version and distribution: Python 2.7.13 :: Anaconda custom (64-bit)

I tried to train image classification model using two servers with infiniband cards. But the speed is a little slow, just like using one server. I used the code of example/image-classifaction.

when training on one server, the training command is

python train_imagenet.py --benchmark 1 --gpus 0,1,2,3,4,5,6,7 --kv-store device --network inception-v3 --batch-size 256   --image-shape 3,299,299

the speed is

INFO:root:start with arguments Namespace(batch_size=256, benchmark=1, data_nthreads=4, data_train=None, data_val=None, disp_batches=20, dtype='float32', gpus='0,1,2,3,4,5,6,7', image_shape='3,299,299', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='inception-v3', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[22:35:19] src/operator/././cudnn_algoreg-inl.h:112: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[22:35:40] src/kvstore/././comm.h:327: only 24 out of 56 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
[22:35:40] src/kvstore/././comm.h:336: .vvv....
[22:35:40] src/kvstore/././comm.h:336: v.vv....
[22:35:40] src/kvstore/././comm.h:336: vv.v....
[22:35:40] src/kvstore/././comm.h:336: vvv.....
[22:35:40] src/kvstore/././comm.h:336: .....vvv
[22:35:40] src/kvstore/././comm.h:336: ....v.vv
[22:35:40] src/kvstore/././comm.h:336: ....vv.v
[22:35:40] src/kvstore/././comm.h:336: ....vvv.
INFO:root:Epoch[0] Batch [20]   Speed: 1065.93 samples/sec      accuracy=0.165365
INFO:root:Epoch[0] Batch [40]   Speed: 1033.22 samples/sec      accuracy=0.989648
INFO:root:Epoch[0] Batch [60]   Speed: 1029.90 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 1029.80 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 1028.05 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 1019.75 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 1025.79 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 1027.82 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 1021.11 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 1025.14 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 1017.72 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [240]  Speed: 1021.09 samples/sec      accuracy=1.000000
INFO:root:Epoch[0] Batch [260]  Speed: 1024.25 samples/sec      accuracy=1.000000

When training with 2 servers, the command is

 python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_imagenet.py --benchmark 1 --gpus 0,1,2,3,4,5,6,7 --kv-store dist_sync --network inception-v3 --num-layers 50 --batch-size 256 --sync-dst-dir /tmp/mxnet  --image-shape 3,299,299

And the speed is

INFO:root:Epoch[0] Batch [20]   Speed: 609.31 samples/sec       accuracy=0.056920
INFO:root:Epoch[0] Batch [20]   Speed: 610.12 samples/sec       accuracy=0.050967
INFO:root:Epoch[0] Batch [40]   Speed: 608.68 samples/sec       accuracy=0.854883
INFO:root:Epoch[0] Batch [40]   Speed: 608.19 samples/sec       accuracy=0.868164
INFO:root:Epoch[0] Batch [60]   Speed: 602.48 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [60]   Speed: 603.86 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 603.11 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [80]   Speed: 603.87 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 607.30 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [100]  Speed: 606.54 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 604.53 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [120]  Speed: 602.63 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 601.27 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [140]  Speed: 603.67 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 603.64 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [160]  Speed: 602.81 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 606.20 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [180]  Speed: 606.28 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 604.40 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [200]  Speed: 604.28 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 605.54 samples/sec       accuracy=1.000000
INFO:root:Epoch[0] Batch [220]  Speed: 605.21 samples/sec       accuracy=1.000000

It seems that using distributed training with 2 servers the speed is only a little better than standalone training(600x2 samples/sec VS 1000 samples/sec).

I tried to test the IB bandwidth using iperf, it can get 24.0 Gbits/sec using 1 thread, so I think the IB's bandwidth is not the bottleneck.

Does anyone can give any suggestion about distributed training using @@mxnet?

@piiswrong
Copy link
Contributor

Your network seems too small. Is it MNIST? It doesn't make sense to train MNIST on multiple machines

@leoxiaobin
Copy link
Author

@piiswrong , I am training inception-v3 using ImageNet dataset, so it's not a small network.

@szha
Copy link
Member

szha commented Aug 15, 2017

How does it look for dist_sync_device? Or even dist_async?

@leoxiaobin
Copy link
Author

@szha , I have tried dist_sync_device, and I got almost the same result. For dist_async, it using the async SGD, i don't think it can be comparable.

@starimpact
Copy link
Contributor

starimpact commented Aug 15, 2017

actually, the speed is decided by the slowest pc when you use "sync" mode. you can use "async" alternatively. "async" mode is the approximation on "sgd", but still can give a available result.
And... 2 servers per pc is not a good choice. you can try set more servers on each pcs, such as 6. it will speed up obviously when the parameters are big.

@szha
Copy link
Member

szha commented Aug 15, 2017

Mind sharing a bit more on the machines (e.g. GPU types, homogenous or heterogeneous hardware, locality, nvlink, disk I/O speed, etc.)?

@leoxiaobin
Copy link
Author

@szha , every server has 8 TitanXp GPUs and 2 Intel Xeon CPU E5-2650 v2@ 2.60GHz.
The two servers are connected with IB cards.
The test is using --benchmark = 1 configuration, so there is no disk I/O operation.

@leoxiaobin
Copy link
Author

@starimpact , I have tried to use 4 servers per machine, I got almost the same result.

@starimpact
Copy link
Contributor

starimpact commented Aug 15, 2017

I am using mxnet0.8.0, HAHAHA...
I noticed that your "one server " is actually "local", because that your "kvstore=device". the kvstore will use gpu to update parameters.
And , your "two server" is really the distributed mode. in the "dist ..." mode, kvstore will use cpu to update the parameters.
So... your speed descending is normal.

@solin319
Copy link
Contributor

Delete "send_buf.WaitToRead();" in line 217 of the file 'src/kvstore/kvstore_dist.h' can solve the problem.
The compute can't cover the time of data communicate in backward.

@leoxiaobin
Copy link
Author

Follow @solin319 's suggestion, I change the code, now the speed seems normal. Thanks @solin319 .

@piiswrong @mli , it that a bug of mxnet?

@szha
Copy link
Member

szha commented Aug 15, 2017

@leoxiaobin the speed-up can be attributed to the deletion of the barrier before send in kvstore, so it's unlikely a symptom of the bug. If you have to stay in the synchronous land, further increase to the batch size could help. Since switching to dist_sync_device doesn't help the speed, I guess the locality assumption for GPUs doesn't apply and the bandwidth among the GPUs may even be less than from your CPU to other machines (which is through IB). In this case the bottleneck is the GPU communication and IB won't help much, and you need higher ratio of compute to communication.

@ptrendx
Copy link
Member

ptrendx commented Aug 15, 2017

@szha That is not entirely true - I agree with @solin319 that this WaitToRead should not be necessary (the actual communication is done in the lambda pushed to the engine that has send_buf as read dependency, so it will wait for it to be ready). What is more, this basically delays scheduling other copies from GPU to CPU for subsequent communications, thus limiting scaling.
The PR introducing that line mentions crashes when using kvstore in imperative mode. I'm not familiar really how much does imperative way differs from symbolic as far as engine is concerned, but I don't think it should be that different that the dependencies stop working. This is definitely a bug.

@szha
Copy link
Member

szha commented Aug 15, 2017

Thanks, @ptrendx. @madjam for more context.

@madjam
Copy link
Contributor

madjam commented Aug 15, 2017

For context, that barrier was added since an operation such as:

kv.init(2, mx.nd.zeros((50, 50)))

would access memory that is not fully initialized and therefore causes a segfault.

@mli
Copy link
Contributor

mli commented Aug 15, 2017

@madjam 's test case is that send_buf maybe not ready to get data()

agree with @ptrendx that we should remove this WaitToRead. One solution is moving https://github.com/madjam/mxnet/blob/0012f7722d97238a84c33f1bee8cd2926707a7e9/src/kvstore/kvstore_dist.h#L221 into the captured function.

Can someone help contribute a PR for it?

@mli mli reopened this Aug 15, 2017
@mli
Copy link
Contributor

mli commented Aug 15, 2017

#6975

@starimpact
Copy link
Contributor

starimpact commented Aug 16, 2017

in mxnet0.8.0 there is no "send_buf.WaitToReadd()".
lucky for me.^_^
https://github.com/starimpact/mxnet_v0.8.0/blob/bProxy_Weight/src/kvstore/kvstore_dist.h#L412
my mxnet support partial parameters update. welcome to use it.
haha....

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants