-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Training #21
Comments
@xiao1228 The requirements clearly state Python 3.6. I'd advise you to follow them. Multi-GPU training is still a work in progress. If you could help debug this after upgrading your Python that would be great! |
hi I am getting the same error after upgrade to Python 3.6. I will work on that if I can fix it will update you |
Getting the error below by trying to move all the variables to cuda
|
@xiao1228 Have you solved your 1st problem in this issue? I wanna turn the code into multi-gpu and met it, too. I'm still confused. Thank you. |
@xiao1228 @zhaoyang10 the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request. Many thanks!! |
I've changed the code to raise an error when multi-GPU operation is attempted, until this is resolved. Lines 60 to 63 in af0033c
|
I've added multi-GPU training support to the pipeline. See here: #121 |
@alexpolichroniadis , I run your code ,may not work on 4 x 1080ti..... |
Tested on 8x1080tis. What's the trace? |
Epoch Batch xy wh conf cls total nTargets time
loss is growing... |
Looks like it's working. You are training and batches are being pushed through your model. Lack of NCCL support is a problem with your installation of Pytorch, not the code of this repo. The UserWarning can be ignored. |
Also what is your batch size, specified when running train.py (--batch-size) ? |
|
@alexpolichroniadis thank you for your reply :) |
Looking at your output, it looks like you pulled an earlier commit from that PR, based on the exploding loss report I'm seeing. I fixed that today in the latest commit. Are you sure you are working off the latest commit of that PR? |
Today,I just download from https://github.com/alexpolichroniadis/yolov3 |
Maybe, use :https://github.com/alexpolichroniadis/yolov3/tree/multigpu will be better and I will try |
Yes, that is the correct branch to work off. Master still has an older version. |
This code runs like this: Epoch Batch xy wh conf cls total nTargets time and then it keep this state up to now.It looks like this program may not running |
I noticed that you are running this code on Windows. Keep in mind that pytorch's DataParallel might not be operational on Windows machines due to lack of NCCL support, see here. My testing was on an Ubuntu machine. |
@alexpolichroniadis ,Thanks for your help,this code work well now :) |
@alexpolichroniadis I get an error when I run the following on a GCP PyTorch instance with 2 GPUs. I noticed you changed coco.data from the darknet default, so I updated this to point back to the default, and this fixed the error. sudo rm -rf yolov3 && git clone -b multigpu --depth 1 https://github.com/alexpolichroniadis/yolov3
cd yolov3 && python3 train.py
Namespace(accumulated_batches=1, batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='cfg/coco.data', epochs=100, img_size=416, multi_scale=Fa
lse, num_workers=0, resume=False, var=0)
Using CUDA. Available devices:
0 - Tesla P100-PCIE-16GB - 16280MB
1 - Tesla P100-PCIE-16GB - 16280MB
Traceback (most recent call last):
File "train.py", line 234, in <module>
var=opt.var,
File "train.py", line 46, in train
train_loader = ImageLabelDataset(train_path, batch_size, img_size, multi_scale=multi_scale, augment=True)
File "/home/ultralytics/yolov3/utils/datasets.py", line 105, in __init__
for x in self.img_files]
AttributeError: 'ImageLabelDataset' object has no attribute 'img_files' Now I see a seperate problem though, there doesn't appear to be any speedup. Single P100 takes about 0.6s, the same as 2 P100s here:
|
Whats your batch size? If your batch can fit perfectly on one GPU, then (in most cases) you are better off using a single GPU. The benefits of a multi GPU setup is cranking up the batch size and having more images be processed in the same amount of time. Try setting your --batch-size to 128 (or something outside what a single GPU can handle) for example and re-testing. Since batching happens on the CPU, there are also cases where the CPU becomes the bottleneck then (the GPU waits for the batch to be created). This becomes more apparent with big batch sizes. In all there is balance that needs to be found and is not directly apparent. One other thing: With nn.Dataparallel, there is preliminary loading of the GPUs with a copy of the model each. This happens on the first batch and is reflected in the higher time reported when processing the first batch. |
An example: |
@door5719 @alexpolichroniadis thanks for the info. We started on our own multi_gpu branch (https://github.com/ultralytics/yolov3/tree/multi_gpu), with a secondary goal of trying out a different loss approach, selecting a single anchor from the 9 available for each target. The new loss produced significantly worse results, so it appears the current method of selecting one anchor from each yolo layer is correct. In the process we did get multi_gpu operational, though not with the speedups expected. We did not attempt to use a multithreaded PyTorch dataloader, nor PIL in place of OpenCV, as we found both of these slower in our single-GPU profiling last year. We don't have multiple gpu machines on premise so we tested this with GCP Deep Learning VMs. We used https://cloud.google.com/deep-learning-vm/
|
I think |
Because the box2 is torch.FloatTensor, the anchor_vec is on cpu. while the box1 is on GPU.
but, when you fix this, the below will also come out some bug.
you need to transform the data type to GPU or Cuda according to the error info. |
@longxianlei we just PRd our under-development multi_gpu branch into the master branch, so multi-GPU functionality now works. Many of the items you raised above should be resolved. Can you try the latest commit and see if it works for you? See #135 for more info. |
@glenn-jocher keep in mind that batch sizes should be integer multiples of the number of available GPUs. For a batch size of 26 on 4 GPUs, you are essentially pushing 26//4 = 6 images on all GPUs and the two remaining ones are pushed on the last GPU. This is unbalanced as each GPU processes batch sizes of 6/6/6/8. The ideal batch size to test here would be 4*6=24. And multiples of 24 thereafter. Also it is true that the actual bottleneck might be IO at this point. |
Updated times with https://cloud.google.com/deep-learning-vm/
Comparison results from https://github.com/ilkarman/DeepLearningFrameworks |
@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit. https://cloud.google.com/deep-learning-vm/
|
@glenn-jocher I never noticed that the default for the dataloder's num_workers set to 0 because I set it manually all the time, whoops. 😅 Good results indeed. In line with what I was getting. |
Had the same issue. But I've used this repo on multi-GPU before and it's worked well. Somebody had posted saying the batch-size in the last iteration might be lesser than the batch-size given during training so I removed a few images to make the validation set images a multiple of 8, as I'd given 8 as my batch-size during training and it solved the issue. |
For multiple GPu runs: Then stop and by using partially-trained model /backup/yolov4_1000.weights run training with multigpu (up to 4 GPUs): darknet.exe detector train cfg/coco.data cfg/yolov4.cfg /backup/yolov4_1000.weights -gpus 0,1,2,3 If you get a Nan, then for some datasets better to decrease learning rate, for 4 GPUs set learning_rate = 0,00065 (i.e. learning_rate = 0.00261 / GPUs). In this case also increase 4x times burn_in = in your cfg-file. I.e. use burn_in = 4000 instead of 1000. |
@Venky0892 thanks for sharing your approach! It's always helpful to hear about different strategies for addressing issues with multi-GPU training. The community's diverse experiences and insights contribute greatly to refining best practices. Your suggestions will certainly benefit others who may encounter similar challenges during their training process. |
Hi,
Have you tried to run training on multiple gpus?
I am getting the below error when I try to do that.thank you
The text was updated successfully, but these errors were encountered: