Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi_gpu #135

Merged
merged 62 commits into from
Mar 17, 2019
Merged

multi_gpu #135

merged 62 commits into from
Mar 17, 2019

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Mar 17, 2019

For more information see issue #21.

We started a multi_gpu branch (https://github.com/ultralytics/yolov3/tree/multi_gpu), with a secondary goal of trying out a different loss approach, selecting a single anchor from the 9 available for each target. The new loss produced significantly worse results, so it appears the current method of selecting one anchor from each yolo layer is correct. In the process we did get multi_gpu operational, though not with the speedups expected. We did not attempt to use a multithreaded PyTorch dataloader, nor PIL in place of OpenCV, as we found both of these slower in our single-GPU profiling last year.

We don't have multiple gpu machines on premise so we tested this with GCP Deep Learning VMs. We used batch_size=26 (max that 1 P100 can handle) times the number of GPUs. All other training setting were defaults. We selected the fastest batch out of the first 30 for timing purposes. Results are below for our branch and the #121 PR. In both cases the speedups were very poor. It's possible the IO ops were constrained by GCP due to the limited SSD size, we will try again with a larger SSD but we wanted to get these results out here for feedback. If anyone has another repo or PR we can compare against please let us know!

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 500 GB SSD

GPUs batch_size yolov3/tree/multi_gpu yolov3/pull/121
(P100) (images) (s/batch) (s/batch)
1 26 0.91s 1.05s
2 52 1.60s 1.76s
4 104 2.26s 2.81s

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced YOLOv3 model code and testing suite.

📊 Key Changes

  • Removed unused dependencies and streamlined YOLOLayer and forward method for efficiency.
  • Simplified loss computation during training.
  • Enhanced forward method to support ONNX export.
  • Improved performance by ensuring device compatibility during grid creations.
  • Refactored the testing script test.py for clarity.
  • Cleaned up the train.py script, including the removal of unused variables.
  • Updated dataset handling for better indexing and performance.
  • Improved util scripts and functions for maintainability.
  • General code cleanup and refactoring for readability and performance.

🎯 Purpose & Impact

  • 🏎 Speeds up the model’s performance and reduces memory footprint.
  • 🧹 Code cleanup improves maintainability and sets the stage for future features.
  • 🤖 Better ONNX support prepares the model for broader deployment possibilities.
  • 🌍 Device compatibility adjustments ensure more consistent behavior across different computing environments.
  • ✅ Simplified testing and training scripts contribute to a smoother workflow for users setting up and evaluating models.
  • Overall, these changes lead to an improved user and developer experience, making it easier to use and contribute to the project.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Mar 19, 2019

Updated times with batch_size=24, and comparison to existing study.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-highmem-4 (4 vCPUs, 26 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs batch_size 613ce1b COCO epoch
(P100) (images) (s/batch) (min/epoch)
1 24 0.84s 70min
2 48 1.27s 53min
4 96 2.11s 44min

Comprison results from https://github.com/ilkarman/DeepLearningFrameworks
Screenshot 2019-03-19 at 16 36 24

@glenn-jocher
Copy link
Member Author

@alexpolichroniadis, @longxianlei, @LightToYang Great news! Lack of multithreading in the dataloader was slowing down multi-GPU significantly (#141). I reimplented support for DataLoader multithreading, and speeds have improved greatly (more than double in some cases). The new test results are below for the latest commit.

https://cloud.google.com/deep-learning-vm/
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Skylake
GPUs: 1-4 x NVIDIA Tesla P100
HDD: 100 GB SSD

GPUs batch_size speed COCO epoch
(P100) (images) (s/batch) (min/epoch)
1 16 0.39s 48min
2 32 0.48s 29min
4 64 0.65s 20min

chrizandr pushed a commit to chrizandr/yolov3 that referenced this pull request Aug 19, 2019
* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates

* updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant