Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training #21

Closed
xiao1228 opened this issue Oct 2, 2018 · 35 comments
Closed

Multi-GPU Training #21

xiao1228 opened this issue Oct 2, 2018 · 35 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@xiao1228
Copy link

xiao1228 commented Oct 2, 2018

Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you

Traceback (most recent call last):
  File "train.py", line 194, in <module>
    main(opt)
  File "train.py", line 128, in main
    loss = model(imgs, targets, requestPrecision=True)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/opt/anaconda/envs/pytorch_p35/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 1 required positional argument: 'x'
@glenn-jocher
Copy link
Member

@xiao1228 The requirements clearly state Python 3.6. I'd advise you to follow them.

Multi-GPU training is still a work in progress. If you could help debug this after upgrading your Python that would be great!

@glenn-jocher glenn-jocher changed the title training on multiple gpu Multi-GPU Training Oct 2, 2018
@xiao1228
Copy link
Author

xiao1228 commented Oct 3, 2018

hi I am getting the same error after upgrade to Python 3.6. I will work on that if I can fix it will update you

@xiao1228
Copy link
Author

Getting the error below by trying to move all the variables to cuda

utils/utils.py", line 293, in build_targets
TP[b, i] = (pconf > 0.5) & (iou_pred > 0.5) & (pcls == tc)
RuntimeError: Assertion `THCTensor_(checkGPU)(state, 3, self_, src1, src2)' failed. at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:688

@zhaoyang10
Copy link

@xiao1228 Have you solved your 1st problem in this issue? I wanna turn the code into multi-gpu and met it, too. I'm still confused. Thank you.

@glenn-jocher
Copy link
Member

@xiao1228 @zhaoyang10 the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request. Many thanks!!

@glenn-jocher
Copy link
Member

I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved.

yolov3/train.py

Lines 60 to 63 in af0033c

if torch.cuda.device_count() > 1:
raise Exception('Multi-GPU not currently supported: https://github.com/ultralytics/yolov3/issues/21')
# print('Using ', torch.cuda.device_count(), ' GPUs')
# model = nn.DataParallel(model)

@alexpolichroniadis
Copy link

alexpolichroniadis commented Mar 7, 2019

I've added multi-GPU training support to the pipeline.

See here: #121

@door5719
Copy link

door5719 commented Mar 8, 2019

@alexpolichroniadis , I run your code ,may not work on 4 x 1080ti.....

@alexpolichroniadis
Copy link

Tested on 8x1080tis.

What's the trace?

@door5719
Copy link

door5719 commented Mar 8, 2019

Epoch Batch xy wh conf cls total nTargets time
C:\Users\NJ\Anaconda3\lib\site-packages\torch\nn\parallel_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
C:\Users\NJ\Anaconda3\lib\site-packages\torch\cuda\nccl.py:24: UserWarning: PyTorch is not compiled with NCCL support
warnings.warn('PyTorch is not compiled with NCCL support')

    0/99      0/3643      1.13      5.46       555      13.3       575   1.6e+03      26.3
    0/99      1/3643      1.65       8.2       833        20       863   3.2e+03      3.28
    0/99      2/3643      2.19      11.2  1.11e+03      26.7  1.15e+03   4.8e+03      1.77
    0/99      3/3643      2.72      14.1  1.39e+03      33.3  1.44e+03   6.4e+03      1.77
    0/99      4/3643      3.25      17.1  1.67e+03        40  1.73e+03     8e+03      1.82
    0/99      5/3643      3.78      20.1  1.94e+03      46.6  2.01e+03   9.6e+03      1.64
    0/99      6/3643      4.31      23.1  2.22e+03      53.3   2.3e+03  1.12e+04      1.66
    0/99      7/3643      4.83      26.1   2.5e+03      59.9  2.59e+03  1.28e+04      1.73
    0/99      8/3643      5.35      29.1  2.78e+03      66.6  2.88e+03  1.44e+04      1.81
    0/99      9/3643      5.87      32.1  3.05e+03      73.2  3.16e+03   1.6e+04      1.62
    0/99     10/3643       6.4      35.1  3.33e+03      79.9  3.45e+03  1.76e+04      1.74
    0/99     11/3643      6.92      38.1  3.61e+03      86.6  3.74e+03  1.92e+04      1.65
    0/99     12/3643      7.45      41.1  3.89e+03      93.2  4.03e+03  2.08e+04      1.73
....
....
    0/99    333/3643       176       830  5.25e+04   2.2e+03  5.57e+04  5.34e+05      1.88
    0/99    334/3643       177       831  5.25e+04  2.21e+03  5.57e+04  5.36e+05       1.7
    0/99    335/3643       177       833  5.25e+04  2.21e+03  5.58e+04  5.38e+05      1.79
    0/99    336/3643       178       835  5.26e+04  2.22e+03  5.58e+04  5.39e+05       1.8
    0/99    337/3643       178       836  5.26e+04  2.23e+03  5.59e+04  5.41e+05      1.68
    0/99    338/3643       179       838  5.27e+04  2.23e+03  5.59e+04  5.42e+05      1.85
    0/99    339/3643       179       840  5.27e+04  2.24e+03   5.6e+04  5.44e+05      1.91
    0/99    340/3643       180       841  5.27e+04  2.24e+03   5.6e+04  5.46e+05      1.84
    0/99    341/3643       180       843  5.28e+04  2.25e+03   5.6e+04  5.47e+05      1.83
    0/99    342/3643       181       845  5.28e+04  2.26e+03  5.61e+04  5.49e+05       1.7
    0/99    343/3643       181       846  5.29e+04  2.26e+03  5.61e+04   5.5e+05      1.95
    0/99    344/3643       182       848  5.29e+04  2.27e+03  5.62e+04  5.52e+05      1.89

loss is growing...

@alexpolichroniadis
Copy link

Looks like it's working. You are training and batches are being pushed through your model. Lack of NCCL support is a problem with your installation of Pytorch, not the code of this repo. The UserWarning can be ignored.

@alexpolichroniadis
Copy link

alexpolichroniadis commented Mar 8, 2019

Also what is your batch size, specified when running train.py (--batch-size) ?

@door5719
Copy link

door5719 commented Mar 8, 2019

parser.add_argument('--epochs', type=int, default=100, help='number of epochs')
parser.add_argument('--batch-size', type=int, default=32, help='size of each image batch')
parser.add_argument('--accumulated-batches', type=int, default=1, help='number of batches before optimizer step')
parser.add_argument('--cfg', type=str, default='cfg/yolov3.cfg', help='cfg file path')
parser.add_argument('--data-cfg', type=str, default='cfg/coco.data', help='coco.data file path')
parser.add_argument('--multi-scale', action='store_true', help='random image sizes per batch 320 - 608')
parser.add_argument('--img-size', type=int, default=32 * 13, help='pixels')
parser.add_argument('--resume', action='store_true', help='resume training flag')
parser.add_argument('--num-workers', type=int, default=0, help='number of workers for dataloader')
parser.add_argument('--var', type=float, default=0, help='test variable')

@door5719
Copy link

door5719 commented Mar 8, 2019

@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size

@alexpolichroniadis
Copy link

alexpolichroniadis commented Mar 8, 2019

@alexpolichroniadis thank you for your reply :)
I use the default params but the modify batch size

Looking at your output, it looks like you pulled an earlier commit from that PR, based on the exploding loss report I'm seeing. I fixed that today in the latest commit. Are you sure you are working off the latest commit of that PR?

@door5719
Copy link

door5719 commented Mar 8, 2019

Today,I just download from https://github.com/alexpolichroniadis/yolov3

@door5719
Copy link

door5719 commented Mar 8, 2019

Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try

@alexpolichroniadis
Copy link

Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try

Yes, that is the correct branch to work off. Master still has an older version.

@door5719
Copy link

door5719 commented Mar 8, 2019

This code runs like this:
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

Epoch Batch xy wh conf cls total nTargets time

and then it keep this state up to now.It looks like this program may not running

@alexpolichroniadis
Copy link

This code runs like this:
219 module.104.batch_norm_104.bias True 256 [256] 0 0
220 module.105.conv_105.weight True 65280 [255, 256, 1, 1] 0.000114 0.0362
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

Epoch Batch xy wh conf cls total nTargets time

and then it keep this state up to now.It looks like this program may not running

I noticed that you are running this code on Windows. Keep in mind that pytorch's DataParallel might not be operational on Windows machines due to lack of NCCL support, see here. My testing was on an Ubuntu machine.

@door5719
Copy link

door5719 commented Mar 8, 2019

@alexpolichroniadis ,Thanks for your help,this code work well now :)

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 8, 2019

@alexpolichroniadis I get an error when I run the following on a GCP PyTorch instance with 2 GPUs. I noticed you changed coco.data from the darknet default, so I updated this to point back to the default, and this fixed the error.

sudo rm -rf yolov3 && git clone -b multigpu --depth 1 https://github.com/alexpolichroniadis/yolov3
cd yolov3 && python3 train.py

Namespace(accumulated_batches=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', epochs=100, img_size=416, multi_scale=Fa
lse, num_workers=0, resume=False, var=0)
Using CUDA. Available devices: 
0 - Tesla P100-PCIE-16GB - 16280MB
1 - Tesla P100-PCIE-16GB - 16280MB
Traceback (most recent call last):
  File "train.py", line 234, in <module>
    var=opt.var,
  File "train.py", line 46, in train
    train_loader = ImageLabelDataset(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)
  File "/home/ultralytics/yolov3/utils/datasets.py", line 105, in __init__
    for x in self.img_files]
AttributeError: 'ImageLabelDataset' object has no attribute 'img_files'

Now I see a seperate problem though, there doesn't appear to be any speedup. Single P100 takes about 0.6s, the same as 2 P100s here:

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
/opt/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but a
ll input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
    0/99      0/7327      0.51      2.76       277      6.66       287       121         7
    0/99      1/7327      0.52       2.7       278      6.65       287        99     0.712
    0/99      2/7327     0.539      2.83       278      6.65       288       143     0.631
    0/99      3/7327     0.543      2.83       278      6.64       288       123     0.608
...

@alexpolichroniadis
Copy link

alexpolichroniadis commented Mar 8, 2019

Whats your batch size? If your batch can fit perfectly on one GPU, then (in most cases) you are better off using a single GPU. The benefits of a multi GPU setup is cranking up the batch size and having more images be processed in the same amount of time. Try setting your --batch-size to 128 (or something outside what a single GPU can handle) for example and re-testing.

Since batching happens on the CPU, there are also cases where the CPU becomes the bottleneck then (the GPU waits for the batch to be created). This becomes more apparent with big batch sizes. In all there is balance that needs to be found and is not directly apparent.

One other thing: With nn.Dataparallel, there is preliminary loading of the GPUs with a copy of the model each. This happens on the first batch and is reflected in the higher time reported when processing the first batch.

@alexpolichroniadis
Copy link

alexpolichroniadis commented Mar 8, 2019

An example:
On my setup, for a batch size of 128, processing time per batch is 1 sec.
For a batch size of 256, it is 1.6sec.
(all cases with dataparallel on).
On a single 1080Ti a batch size of 128 is not doable.

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 16, 2019

@door5719 @alexpolichroniadis thanks for the info. We started on our own multi_gpu branch (https://github.com/ultralytics/yolov3/tree/multi_gpu), with a secondary goal of trying out a different loss approach, selecting a single anchor from the 9 available for each target. The new loss produced significantly worse results, so it appears the current method of selecting one anchor from each yolo layer is correct. In the process we did get multi_gpu operational, though not with the speedups expected. We did not attempt to use a multithreaded PyTorch dataloader, nor PIL in place of OpenCV, as we found both of these slower in our single-GPU profiling last year.

We don't have multiple gpu machines on premise so we tested this with GCP Deep Learning VMs. We used batch_size=26 (max that 1 P100 can handle) times the number of GPUs. All other training setting were defaults. We selected the fastest batch out of the first 30 for timing purposes. Results are below for our branch and the #121 PR. In both cases the speedups were very poor. It's possible the IO ops were constrained by GCP due to the limited SSD size, we will try again with a larger SSD but we wanted to get these results out here for feedback. If anyone has another repo or PR we can compare against please let us know!

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 500 GB SSD

GPUs batch_size yolov3/tree/multi_gpu yolov3/pull/121
(P100) (images) (s/batch) (s/batch)
1 26 0.91s 1.05s
2 52 1.60s 1.76s
4 104 2.26s 2.81s

@LightToYang
Copy link

I think torch.nn.parallel.DistributedDataParallel is better than nn.DataParallel. The usage of DataParallel should be bottleneck.

@glenn-jocher glenn-jocher mentioned this issue Mar 17, 2019
@longxianlei
Copy link

Because the box2 is torch.FloatTensor, the anchor_vec is on cpu. while the box1 is on GPU.
so, just use .cuda() to transform the data into torch.cuda.FloatTensor()
` box2 = anchor_vec.cuda().unsqueeze(1)

    inter_area = torch.min(box1, box2).prod(2)`

but, when you fix this, the below will also come out some bug.
` txy[b, a, gj, gi] = gxy - gxy.floor()

    # Width and height
    twh[b, a, gj, gi] = torch.log(gwh/ anchor_vec[a]) `

you need to transform the data type to GPU or Cuda according to the error info.
However, the main reason for multi-GPU training lies in
for i, (imgs, targets, _, _) in enumerate(dataloader):
where the imgs is a tensor, but the targets are lists. When parallel the imgs.to(device). The imgs are divided into batch_size/GPU_nums. But the targets cannot targets.to(device)(since it is a list), and the targets are the same num as the batch_size, cannot distribute into every GPUs.
if nM > 0: lxy = k * MSELoss(xy[mask], txy[mask]) lwh = k * MSELoss(wh[mask], twh[mask])
the xy, txy, wh, twh is not the same dims as the batch_size.
the xy, wh is batch_size/GPU_nums.
but the txy, twh is the targets_nums( batch_size). There will occur some error.

@glenn-jocher
Copy link
Member

@longxianlei we just PRd our under-development multi_gpu branch into the master branch, so multi-GPU functionality now works. Many of the items you raised above should be resolved. Can you try the latest commit and see if it works for you? See #135 for more info.

@alexpolichroniadis
Copy link

@glenn-jocher keep in mind that batch sizes should be integer multiples of the number of available GPUs. For a batch size of 26 on 4 GPUs, you are essentially pushing 26//4 = 6 images on all GPUs and the two remaining ones are pushed on the last GPU. This is unbalanced as each GPU processes batch sizes of 6/6/6/8.

The ideal batch size to test here would be 4*6=24. And multiples of 24 thereafter. Also it is true that the actual bottleneck might be IO at this point.

@glenn-jocher
Copy link
Member

Updated times with batch_size=24, and comparison to existing study.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs batch_size 613ce1b COCO epoch
(P100) (images) (s/batch) (min/epoch)
1 24 0.84s 70min
2 48 1.27s 53min
4 96 2.11s 44min

Comparison results from https://github.com/ilkarman/DeepLearningFrameworks
Screenshot 2019-03-19 at 16 36 24

@glenn-jocher
Copy link
Member

@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs batch_size speed COCO epoch
(P100) (images) (s/batch) (min/epoch)
1 16 0.39s 48min
2 32 0.48s 29min
4 64 0.65s 20min

@alexpolichroniadis
Copy link

alexpolichroniadis commented Mar 21, 2019

@glenn-jocher I never noticed that the default for the dataloder's num_workers set to 0 because I set it manually all the time, whoops. 😅

Good results indeed. In line with what I was getting.

@jarunm
Copy link

jarunm commented Aug 13, 2020

Had the same issue. But I've used this repo on multi-GPU before and it's worked well. Somebody had posted saying the batch-size in the last iteration might be lesser than the batch-size given during training so I removed a few images to make the validation set images a multiple of 8, as I'd given 8 as my batch-size during training and it solved the issue.

@Venky0892
Copy link

For multiple GPu runs:
Train it first on 1 GPU for like 1000 iterations: darknet.exe detector train cfg/coco.data cfg/yolov4.cfg yolov4.conv.137

Then stop and by using partially-trained model /backup/yolov4_1000.weights run training with multigpu (up to 4 GPUs): darknet.exe detector train cfg/coco.data cfg/yolov4.cfg /backup/yolov4_1000.weights -gpus 0,1,2,3

If you get a Nan, then for some datasets better to decrease learning rate, for 4 GPUs set learning_rate = 0,00065 (i.e. learning_rate = 0.00261 / GPUs). In this case also increase 4x times burn_in = in your cfg-file. I.e. use burn_in = 4000 instead of 1000.

@glenn-jocher
Copy link
Member

@Venky0892 thanks for sharing your approach! It's always helpful to hear about different strategies for addressing issues with multi-GPU training. The community's diverse experiences and insights contribute greatly to refining best practices. Your suggestions will certainly benefit others who may encounter similar challenges during their training process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants