Multi-GPU Training #21

xiao1228 · 2018-10-02T14:48:40Z

Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you

Traceback (most recent call last):
  File "train.py", line 194, in <module>
    main(opt)
  File "train.py", line 128, in main
    loss = model(imgs, targets, requestPrecision=True)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2018-10-02T20:13:28Z

@xiao1228 The requirements clearly state Python 3.6. I'd advise you to follow them.

Multi-GPU training is still a work in progress. If you could help debug this after upgrading your Python that would be great!

xiao1228 · 2018-10-03T10:15:53Z

hi I am getting the same error after upgrade to Python 3.6. I will work on that if I can fix it will update you

xiao1228 · 2018-10-10T16:23:18Z

Getting the error below by trying to move all the variables to cuda

utils/utils.py", line 293, in build_targets
TP[b, i] = (pconf > 0.5) & (iou_pred > 0.5) & (pcls == tc)
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, self_, src1, src2)' failed. at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:688

zhaoyang10 · 2018-10-31T13:05:22Z

@xiao1228 Have you solved your 1st problem in this issue? I wanna turn the code into multi-gpu and met it, too. I'm still confused. Thank you.

glenn-jocher · 2018-11-02T14:52:24Z

@xiao1228 @zhaoyang10 the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request. Many thanks!!

glenn-jocher · 2018-11-29T11:02:00Z

I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.

yolov3/train.py

Lines 60 to 63 in af0033c

    
           if torch.cuda.device_count() > 1: 
        
               raise Exception('Multi-GPU not currently supported: https://github.com/ultralytics/yolov3/issues/21') 
        
               # print('Using ', torch.cuda.device_count(), ' GPUs') 
        
               # model = nn.DataParallel(model)

alexpolichroniadis · 2019-03-07T19:27:51Z

I've added multi-GPU training support to the pipeline.

See here: #121

door5719 · 2019-03-08T03:36:27Z

@alexpolichroniadis , I run your code ,may not work on 4 x 1080ti.....

alexpolichroniadis · 2019-03-08T03:56:19Z

Tested on 8x1080tis.

What's the trace?

door5719 · 2019-03-08T07:45:09Z

Epoch Batch xy wh conf cls total nTargets time
C:\Users\NJ\Anaconda3\lib\site-packages\torch\nn\parallel_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
C:\Users\NJ\Anaconda3\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn('PyTorch is not compiled with NCCL support')

    0/99      0/3643      1.13      5.46       555      13.3       575   1.6e+03      26.3
    0/99      1/3643      1.65       8.2       833        20       863   3.2e+03      3.28
    0/99      2/3643      2.19      11.2  1.11e+03      26.7  1.15e+03   4.8e+03      1.77
    0/99      3/3643      2.72      14.1  1.39e+03      33.3  1.44e+03   6.4e+03      1.77
    0/99      4/3643      3.25      17.1  1.67e+03        40  1.73e+03     8e+03      1.82
    0/99      5/3643      3.78      20.1  1.94e+03      46.6  2.01e+03   9.6e+03      1.64
    0/99      6/3643      4.31      23.1  2.22e+03      53.3   2.3e+03  1.12e+04      1.66
    0/99      7/3643      4.83      26.1   2.5e+03      59.9  2.59e+03  1.28e+04      1.73
    0/99      8/3643      5.35      29.1  2.78e+03      66.6  2.88e+03  1.44e+04      1.81
    0/99      9/3643      5.87      32.1  3.05e+03      73.2  3.16e+03   1.6e+04      1.62
    0/99     10/3643       6.4      35.1  3.33e+03      79.9  3.45e+03  1.76e+04      1.74
    0/99     11/3643      6.92      38.1  3.61e+03      86.6  3.74e+03  1.92e+04      1.65
    0/99     12/3643      7.45      41.1  3.89e+03      93.2  4.03e+03  2.08e+04      1.73
....
....
    0/99    333/3643       176       830  5.25e+04   2.2e+03  5.57e+04  5.34e+05      1.88
    0/99    334/3643       177       831  5.25e+04  2.21e+03  5.57e+04  5.36e+05       1.7
    0/99    335/3643       177       833  5.25e+04  2.21e+03  5.58e+04  5.38e+05      1.79
    0/99    336/3643       178       835  5.26e+04  2.22e+03  5.58e+04  5.39e+05       1.8
    0/99    337/3643       178       836  5.26e+04  2.23e+03  5.59e+04  5.41e+05      1.68
    0/99    338/3643       179       838  5.27e+04  2.23e+03  5.59e+04  5.42e+05      1.85
    0/99    339/3643       179       840  5.27e+04  2.24e+03   5.6e+04  5.44e+05      1.91
    0/99    340/3643       180       841  5.27e+04  2.24e+03   5.6e+04  5.46e+05      1.84
    0/99    341/3643       180       843  5.28e+04  2.25e+03   5.6e+04  5.47e+05      1.83
    0/99    342/3643       181       845  5.28e+04  2.26e+03  5.61e+04  5.49e+05       1.7
    0/99    343/3643       181       846  5.29e+04  2.26e+03  5.61e+04   5.5e+05      1.95
    0/99    344/3643       182       848  5.29e+04  2.27e+03  5.62e+04  5.52e+05      1.89

loss is growing...

alexpolichroniadis · 2019-03-08T07:48:04Z

Looks like it's working. You are training and batches are being pushed through your model. Lack of NCCL support is a problem with your installation of Pytorch, not the code of this repo. The UserWarning can be ignored.

alexpolichroniadis · 2019-03-08T07:49:42Z

Also what is your batch size, specified when running train.py (--batch-size) ?

door5719 · 2019-03-08T08:13:01Z

parser.add_argument('--epochs', type=int, default=100, help='number of epochs')
parser.add_argument('--batch-size', type=int, default=32, help='size of each image batch')
parser.add_argument('--accumulated-batches', type=int, default=1, help='number of batches before optimizer step')
parser.add_argument('--cfg', type=str, default='cfg/yolov3.cfg', help='cfg file path')
parser.add_argument('--data-cfg', type=str, default='cfg/coco.data', help='coco.data file path')
parser.add_argument('--multi-scale', action='store_true', help='random image sizes per batch 320 - 608')
parser.add_argument('--img-size', type=int, default=32 * 13, help='pixels')
parser.add_argument('--resume', action='store_true', help='resume training flag')
parser.add_argument('--num-workers', type=int, default=0, help='number of workers for dataloader')
parser.add_argument('--var', type=float, default=0, help='test variable')

door5719 · 2019-03-08T08:15:25Z

@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size

alexpolichroniadis · 2019-03-08T08:26:24Z

@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size

Looking at your output, it looks like you pulled an earlier commit from that PR, based on the exploding loss report I'm seeing. I fixed that today in the latest commit. Are you sure you are working off the latest commit of that PR?

door5719 · 2019-03-08T08:32:11Z

Today,I just download from https://github.com/alexpolichroniadis/yolov3

door5719 · 2019-03-08T08:33:42Z

Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try

alexpolichroniadis · 2019-03-08T08:36:51Z

Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try

Yes, that is the correct branch to work off. Master still has an older version.

door5719 · 2019-03-08T08:54:08Z

This code runs like this：
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

Epoch Batch xy wh conf cls total nTargets time

and then it keep this state up to now.It looks like this program may not running

alexpolichroniadis · 2019-03-08T09:20:14Z

This code runs like this：
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

Epoch Batch xy wh conf cls total nTargets time

and then it keep this state up to now.It looks like this program may not running

I noticed that you are running this code on Windows. Keep in mind that pytorch's DataParallel might not be operational on Windows machines due to lack of NCCL support, see here. My testing was on an Ubuntu machine.

door5719 · 2019-03-08T09:35:58Z

@alexpolichroniadis ,Thanks for your help,this code work well now :)

glenn-jocher · 2019-03-08T17:23:29Z

@alexpolichroniadis I get an error when I run the following on a GCP PyTorch instance with 2 GPUs. I noticed you changed coco.data from the darknet default, so I updated this to point back to the default, and this fixed the error.

sudo rm -rf yolov3 && git clone -b multigpu --depth 1 https://github.com/alexpolichroniadis/yolov3
cd yolov3 && python3 train.py

Namespace(accumulated_batches=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', epochs=100, img_size=416, multi_scale=Fa
lse, num_workers=0, resume=False, var=0)
Using CUDA. Available devices: 
0 - Tesla P100-PCIE-16GB - 16280MB
1 - Tesla P100-PCIE-16GB - 16280MB
Traceback (most recent call last):
  File "train.py", line 234, in <module>
    var=opt.var,
  File "train.py", line 46, in train
    train_loader = ImageLabelDataset(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)
  File "/home/ultralytics/yolov3/utils/datasets.py", line 105, in __init__
    for x in self.img_files]
AttributeError: 'ImageLabelDataset' object has no attribute 'img_files'

Now I see a seperate problem though, there doesn't appear to be any speedup. Single P100 takes about 0.6s, the same as 2 P100s here:

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
/opt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but a
ll input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
    0/99      0/7327      0.51      2.76       277      6.66       287       121         7
    0/99      1/7327      0.52       2.7       278      6.65       287        99     0.712
    0/99      2/7327     0.539      2.83       278      6.65       288       143     0.631
    0/99      3/7327     0.543      2.83       278      6.64       288       123     0.608
...

alexpolichroniadis · 2019-03-08T17:47:11Z

Whats your batch size? If your batch can fit perfectly on one GPU, then (in most cases) you are better off using a single GPU. The benefits of a multi GPU setup is cranking up the batch size and having more images be processed in the same amount of time. Try setting your --batch-size to 128 (or something outside what a single GPU can handle) for example and re-testing.

Since batching happens on the CPU, there are also cases where the CPU becomes the bottleneck then (the GPU waits for the batch to be created). This becomes more apparent with big batch sizes. In all there is balance that needs to be found and is not directly apparent.

One other thing: With nn.Dataparallel, there is preliminary loading of the GPUs with a copy of the model each. This happens on the first batch and is reflected in the higher time reported when processing the first batch.

alexpolichroniadis · 2019-03-08T18:04:32Z

An example:
On my setup, for a batch size of 128, processing time per batch is 1 sec.
For a batch size of 256, it is 1.6sec.
(all cases with dataparallel on).
On a single 1080Ti a batch size of 128 is not doable.

glenn-jocher · 2019-03-16T18:35:28Z

@door5719 @alexpolichroniadis thanks for the info. We started on our own multi_gpu branch (https://github.com/ultralytics/yolov3/tree/multi_gpu), with a secondary goal of trying out a different loss approach, selecting a single anchor from the 9 available for each target. The new loss produced significantly worse results, so it appears the current method of selecting one anchor from each yolo layer is correct. In the process we did get multi_gpu operational, though not with the speedups expected. We did not attempt to use a multithreaded PyTorch dataloader, nor PIL in place of OpenCV, as we found both of these slower in our single-GPU profiling last year.

We don't have multiple gpu machines on premise so we tested this with GCP Deep Learning VMs. We used batch_size=26 (max that 1 P100 can handle) times the number of GPUs. All other training setting were defaults. We selected the fastest batch out of the first 30 for timing purposes. Results are below for our branch and the #121 PR. In both cases the speedups were very poor. It's possible the IO ops were constrained by GCP due to the limited SSD size, we will try again with a larger SSD but we wanted to get these results out here for feedback. If anyone has another repo or PR we can compare against please let us know!

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 500 GB SSD

GPUs	`batch_size`	yolov3/tree/multi_gpu	yolov3/pull/121
(P100)	(images)	(s/batch)	(s/batch)
1	26	0.91s	1.05s
2	52	1.60s	1.76s
4	104	2.26s	2.81s

LightToYang · 2019-03-17T05:46:10Z

I think torch.nn.parallel.DistributedDataParallel is better than nn.DataParallel. The usage of DataParallel should be bottleneck.

longxianlei · 2019-03-18T03:46:01Z

Because the box2 is torch.FloatTensor, the anchor_vec is on cpu. while the box1 is on GPU.
so, just use .cuda() to transform the data into torch.cuda.FloatTensor()
` box2 = anchor_vec.cuda().unsqueeze(1)

    inter_area = torch.min(box1, box2).prod(2)`

but, when you fix this, the below will also come out some bug.
` txy[b, a, gj, gi] = gxy - gxy.floor()

    # Width and height
    twh[b, a, gj, gi] = torch.log(gwh/ anchor_vec[a]) `

you need to transform the data type to GPU or Cuda according to the error info.
However, the main reason for multi-GPU training lies in
for i, (imgs, targets, _, _) in enumerate(dataloader):
where the imgs is a tensor, but the targets are lists. When parallel the imgs.to(device). The imgs are divided into batch_size/GPU_nums. But the targets cannot targets.to(device)(since it is a list), and the targets are the same num as the batch_size, cannot distribute into every GPUs.
if nM > 0: lxy = k * MSELoss(xy[mask], txy[mask]) lwh = k * MSELoss(wh[mask], twh[mask])
the xy, txy, wh, twh is not the same dims as the batch_size.
the xy, wh is batch_size/GPU_nums.
but the txy, twh is the targets_nums( batch_size). There will occur some error.

glenn-jocher · 2019-03-18T10:58:25Z

@longxianlei we just PRd our under-development multi_gpu branch into the master branch, so multi-GPU functionality now works. Many of the items you raised above should be resolved. Can you try the latest commit and see if it works for you? See #135 for more info.

alexpolichroniadis · 2019-03-18T13:30:36Z

@glenn-jocher keep in mind that batch sizes should be integer multiples of the number of available GPUs. For a batch size of 26 on 4 GPUs, you are essentially pushing 26//4 = 6 images on all GPUs and the two remaining ones are pushed on the last GPU. This is unbalanced as each GPU processes batch sizes of 6/6/6/8.

The ideal batch size to test here would be 4*6=24. And multiples of 24 thereafter. Also it is true that the actual bottleneck might be IO at this point.

glenn-jocher · 2019-03-19T14:47:18Z

Updated times with batch_size=24, and comparison to existing study.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs	`batch_size`	`613ce1b`	COCO epoch
(P100)	(images)	(s/batch)	(min/epoch)
1	24	0.84s	70min
2	48	1.27s	53min
4	96	2.11s	44min

Comparison results from https://github.com/ilkarman/DeepLearningFrameworks

glenn-jocher · 2019-03-21T13:23:00Z

@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs	`batch_size`	speed	COCO epoch
(P100)	(images)	(s/batch)	(min/epoch)
1	16	0.39s	48min
2	32	0.48s	29min
4	64	0.65s	20min

alexpolichroniadis · 2019-03-21T14:24:32Z

@glenn-jocher I never noticed that the default for the dataloder's num_workers set to 0 because I set it manually all the time, whoops. 😅

Good results indeed. In line with what I was getting.

jarunm · 2020-08-13T02:06:08Z

Had the same issue. But I've used this repo on multi-GPU before and it's worked well. Somebody had posted saying the batch-size in the last iteration might be lesser than the batch-size given during training so I removed a few images to make the validation set images a multiple of 8, as I'd given 8 as my batch-size during training and it solved the issue.

Venky0892 · 2022-01-16T16:10:13Z

For multiple GPu runs:
Train it first on 1 GPU for like 1000 iterations: darknet.exe detector train cfg/coco.data cfg/yolov4.cfg yolov4.conv.137

Then stop and by using partially-trained model /backup/yolov4_1000.weights run training with multigpu (up to 4 GPUs): darknet.exe detector train cfg/coco.data cfg/yolov4.cfg /backup/yolov4_1000.weights -gpus 0,1,2,3

If you get a Nan, then for some datasets better to decrease learning rate, for 4 GPUs set learning_rate = 0,00065 (i.e. learning_rate = 0.00261 / GPUs). In this case also increase 4x times burn_in = in your cfg-file. I.e. use burn_in = 4000 instead of 1000.

glenn-jocher · 2023-11-14T19:03:51Z

@Venky0892 thanks for sharing your approach! It's always helpful to hear about different strategies for addressing issues with multi-GPU training. The community's diverse experiences and insights contribute greatly to refining best practices. Your suggestions will certainly benefit others who may encounter similar challenges during their training process.

glenn-jocher changed the title ~~training on multiple gpu~~ Multi-GPU Training Oct 2, 2018

glenn-jocher added bug Something isn't working help wanted Extra attention is needed labels Nov 2, 2018

glenn-jocher mentioned this issue Nov 7, 2018

using the model with four GPUs ultralytics/xview-yolov3#4

Closed

glenn-jocher mentioned this issue Nov 19, 2018

Train erro #34

Closed

glenn-jocher mentioned this issue Nov 29, 2018

rloss['nT'] is zero when training #38

Closed

glenn-jocher mentioned this issue Dec 2, 2018

RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.FloatTensor for argument #2 'other' #42

Closed

glenn-jocher mentioned this issue Feb 17, 2019

mean_mAP issue #50

Closed

glenn-jocher self-assigned this Feb 25, 2019

alexpolichroniadis referenced this issue Mar 11, 2019

updates

3bea4da

glenn-jocher mentioned this issue Mar 17, 2019

multi_gpu #135

Merged

glenn-jocher closed this as completed Mar 21, 2019

kame-lqm mentioned this issue Feb 23, 2020

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. erikguo/yolov3#20

Closed

winnerCR7 mentioned this issue Jul 3, 2020

After interrupting training, load weights/last.pt to continue training #1368

Closed

Multi-GPU Training #21

Multi-GPU Training #21

Comments

xiao1228 commented Oct 2, 2018 • edited by glenn-jocher Loading

glenn-jocher commented Oct 2, 2018

xiao1228 commented Oct 3, 2018

xiao1228 commented Oct 10, 2018

zhaoyang10 commented Oct 31, 2018

glenn-jocher commented Nov 2, 2018

glenn-jocher commented Nov 29, 2018

alexpolichroniadis commented Mar 7, 2019 • edited Loading

door5719 commented Mar 8, 2019

alexpolichroniadis commented Mar 8, 2019

door5719 commented Mar 8, 2019 • edited by glenn-jocher Loading

alexpolichroniadis commented Mar 8, 2019

alexpolichroniadis commented Mar 8, 2019 • edited Loading

door5719 commented Mar 8, 2019

door5719 commented Mar 8, 2019

alexpolichroniadis commented Mar 8, 2019 • edited Loading

door5719 commented Mar 8, 2019

door5719 commented Mar 8, 2019

alexpolichroniadis commented Mar 8, 2019

door5719 commented Mar 8, 2019

alexpolichroniadis commented Mar 8, 2019

door5719 commented Mar 8, 2019

glenn-jocher commented Mar 8, 2019 • edited Loading

alexpolichroniadis commented Mar 8, 2019 • edited Loading

alexpolichroniadis commented Mar 8, 2019 • edited Loading

glenn-jocher commented Mar 16, 2019 • edited Loading

LightToYang commented Mar 17, 2019

longxianlei commented Mar 18, 2019

glenn-jocher commented Mar 18, 2019

alexpolichroniadis commented Mar 18, 2019

glenn-jocher commented Mar 19, 2019

glenn-jocher commented Mar 21, 2019

alexpolichroniadis commented Mar 21, 2019 • edited Loading

jarunm commented Aug 13, 2020

Venky0892 commented Jan 16, 2022

glenn-jocher commented Nov 14, 2023

xiao1228 commented Oct 2, 2018 •

edited by glenn-jocher

Loading

alexpolichroniadis commented Mar 7, 2019 •

edited

Loading

door5719 commented Mar 8, 2019 •

edited by glenn-jocher

Loading

alexpolichroniadis commented Mar 8, 2019 •

edited

Loading

alexpolichroniadis commented Mar 8, 2019 •

edited

Loading

glenn-jocher commented Mar 8, 2019 •

edited

Loading

alexpolichroniadis commented Mar 8, 2019 •

edited

Loading

alexpolichroniadis commented Mar 8, 2019 •

edited

Loading

glenn-jocher commented Mar 16, 2019 •

edited

Loading

alexpolichroniadis commented Mar 21, 2019 •

edited

Loading