Initial commit

KaliberAI · Mar 18, 2019 · d8706b0 · d8706b0
commit d8706b0
Show file tree

Hide file tree

Showing 38 changed files with 3,643 additions and 0 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,16 @@
+.git
+.DS_Store
+__pycache__
+*.pyc
+*.o
+*.so
+*.egg-info
+build
+dist
+.vscode
+*.jpg
+!tests/*.jpg
+*.pkl
+*.torch
+*.plan
+
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,16 @@
+.DS_Store
+__pycache__
+*.pyc
+*.o
+*.so
+odtk/tensorrt/src/*.py
+odtk/tensorrt/src/*.cxx
+*.egg-info
+build
+dist
+.vscode
+*.jpg
+!tests/*.jpg
+*.pkl
+*.torch
+*.plan
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,4 @@
+FROM nvcr.io/nvidia/pytorch:19.02-py3
+
+COPY . retinanet/
+RUN pip install --no-cache-dir -e retinanet/
diff --git a/INFERENCE.md b/INFERENCE.md
@@ -0,0 +1,85 @@
+# Inference
+
+We provide two ways to do inference with `retinanet-examples`:
+* PyTorch inference using a trained model (FP32 or FP16 precision)
+* Export trained pytorch model to TensorRT for optimized inference (FP32, FP16 or INT8 precision)
+
+`retinanet-examples infer` will run distributed inference across all available GPUs. When using PyTorch, the default behavior is to run inference with mixed precision. The precision used when running inference with a TensorRT engine will correspond to the precision chosen when the model was exported to TensorRT (see [TensorRT section](#exporting-trained-pytorch-model-to-tensorrt section) below). 
+
+**NOTE**: Availability of HW support for fast FP16 and INT8 precision like [NVIDIA Tensor Cores](https://www.nvidia.com/en-us/data-center/tensorcore/) depends on your GPU architecture: Volta or newer GPUs support both FP16 and INT8, and Pascal GPUs can support either FP16 or INT8. 
+
+## PyTorch Inference
+
+Evaluate trained PyTorch detection model on COCO 2017 (mixed precision):
+
+```bash
+retinanet infer model.pth --images=/data/coco/val2017 --annotations=instances_val2017.json --batch 8
+```
+**NOTE**: `--batch N` specifies *global* batch size to be used for inference. The batch size per GPU will be `N // num_gpus`.
+
+Use full precision (FP32) during evaluation:
+
+```bash
+retinanet infer model.pth --images=/data/coco/val2017 --annotations=instances_val2017.json --full-precision
+```
+
+Evaluate PyTorch detection model with a small input image size:
+
+```bash
+retinanet infer model.pth --images=/data/coco/val2017 --annotations=instances_val2017.json  --resize 400 --max-size 640
+```
+Here, the shorter side of the input images will be resized to `resize` as long as the longer side doesn't get larger than `max-size`, otherwise the longer side of the input image will be resized to `max-size`.
+
+**NOTE**: To get best accuracy, training the model at the preferred export size is encouraged.
+
+Run inference using your own dataset:
+
+```bash
+retinanet infer model.pth --images=/data/your_images --output=detections.json
+```
+
+## Exporting trained PyTorch model to TensorRT
+
+`retinanet-examples` provides an simple workflow to optimize a trained PyTorch model for inference deployment using TensorRT. The PyTorch model is exported to [ONNX](https://github.com/onnx/onnx), and then the ONNX model is consumed and optimized by TensorRT.
+To learn more about TensorRT optimization, refer here: https://developer.nvidia.com/tensorrt
+
+**NOTE**: When a model is optimized with TensorRT, the output is a TensorRT engine (.plan file) that can be used for deployment. This TensorRT engine has several fixed properties that are specified during the export process.
+* Input image size: TensorRT engines only support a fixed input size.
+* Precision: TensorRT supports FP32, FP16, or INT8 precision.
+* Target GPU: TensorRT optimizations are tied to the type of GPU on the system where optimization is performed. They are not transferable across different types of GPUs. Put another way, if you aim to deploy your TensorRT engine on a Tesla T4 GPU, you must run the optimization on a system with a T4 GPU. 
+
+The workflow for exporting a trained PyTorch detection model to TensorRT is as simple as:
+
+```bash
+retinanet-examples export model.pth model_fp16.plan --batch 1 --size 1280
+```
+This will create a TensorRT engine optimized for batch size 1, using an input size of 1280x1280. By default, the engine will be created to run in FP16 precision.
+
+Export your model to use full precision using a non-square input size:
+```bash
+retinanet-examples export model.pth model_fp32.plan --full-precision --batch 1 --size 800 1280
+```
+
+In order to use INT8 precision with TensorRT, you need to provide calibration images (images that are representative of what will be seen at runtime) that will be used to rescale the network.
+```bash
+retinanet-examples export model.pth model_int8.plan --batch 2 --int8 --calibration-images /data/val/ --calibration-batches 10 --calibration-table model_calibration_table
+```
+
+This will randomly select 20 images from `/data/val/` to calibrate the network for INT8 precision. The results from calibration will be saved to `model_calibration_table` that can be used to create subsequent INT8 engines for this model without needed to recalibrate. 
+
+Build an INT8 engine for a previously calibrated model:
+```bash
+retinanet-examples export model.pth model_int8.plan --batch 2 --int8 --calibration-table model_calibration_table
+```
+
+
+## Deployment with TensorRT on NVIDIA Jetson AGX Xavier
+
+We provide a path for deploying trained models with TensorRT onto embedded platforms like [NVIDIA Jetson AGX Xavier](https://developer.nvidia.com/embedded/buy/jetson-agx-xavier-devkit), where PyTorch is not readily available. 
+
+You will need to export your trained PyTorch model to ONNX representation on your host system, and copy the resulting ONNX model to your Jetson AGX Xavier:
+```bash
+retinanet-examples export model.pth model.onnx --size 800 1280
+```
+
+Refer to additional documentation on using the example cppapi code to build the TensorRT engine and run inference here: [cppapi example code](extras/cppapi/README.md)
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,25 @@
+Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+ * Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+ * Neither the name of NVIDIA CORPORATION nor the names of its
+   contributors may be used to endorse or promote products derived
+   from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/README.md b/README.md
@@ -0,0 +1,152 @@
+# RetinaNet Examples
+
+**Fast** and **accurate** single stage object detection with end-to-end GPU optimization.
+
+## Description
+
+[RetinaNet](#references) is a single shot object detector with multiple backbones offering various performance/accuracy trade-offs.
+
+It is optimized for end-to-end GPU processing using:
+* The [PyTorch](https://pytorch.org) deep learning framework
+* NVIDIA [Apex](https://github.com/NVIDIA/apex) for mixed precision and distributed training
+* NVIDIA [DALI](https://github.com/NVIDIA/DALI) for optimized data pre-processing
+* NVIDIA [TensorRT](https://developer.nvidia.com/tensorrt) for high-performance inference
+
+## Disclaimer
+
+This is a research project, not an official NVIDIA product.
+
+## Installation
+
+For best performance, we encourage using the latest [PyTorch NGC docker container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch):
+```bash
+nvidia-docker run --rm --ipc=host -it nvcr.io/nvidia/pytorch:19.02-py3
+```
+
+From the container, simply install retinanet using `pip`:
+```bash
+pip install --no-cache-dir git+https://github.com/nvidia/retinanet-examples
+```
+
+Or you can clone this repository, build and run your own image:
+```bash
+git clone https://github.com/nvidia/retinanet-examples
+docker build -t retinanet:latest retinanet/
+nvidia-docker run --rm --ipc=host -it retinanet:latest
+```
+
+## Usage
+
+Training, inference, evaluation and model export can be done through the `retinanet` utility.
+
+For more details refer to the [INFERENCE](INFERENCE.md) and [TRAINING](TRAINING.md) documentation.
+
+### Training
+
+Train a detection model on [COCO 2017](http://cocodataset.org/#download) from pre-trained backbone:
+```bash
+retinanet train retinanet_rn50fpn.pth --backbone ResNet50FPN \
+    --images /coco/images/train2017/ --annotations /coco/annotations/instances_train2017.json \
+    --val-images /coco/images/val2017/ --val-annotations /coco/annotations/instances_val2017.json
+```
+
+### Fine Tuning
+
+Fine tune a pre-trained model on [your dataset](#datasets), here we'll use [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) using [JSON annotations](https://storage.googleapis.com/coco-dataset/external/PASCAL_VOC.zip):
+```bash
+retinanet train model_mydataset.pth \
+    --fine-tune retinanet_rn50fpn.pth \
+    --classes 20 --iters 10000 --val-iters 1000 --lr 0.0005 \
+    --resize 512 --jitter 480 640 --images /voc/JPEGImages/ \
+    --annotations /voc/pascal_train2012.json --val-annotations /voc/pascal_val2012.json
+```
+
+Note: the shorter side of the input images will be resized to `resize` as long as the longer side doesn't get larger than `max-size`. During training, the images will be randomly randomly resized to a new size within the `jitter` range.
+
+### Inference
+
+Evaluate your detection model on [COCO 2017](http://cocodataset.org/#download):
+```bash
+retinanet infer retinanet_rn50fpn.pth --images /coco/images/val2017/ --annotations /coco/annotations/instances_val2017.json
+```
+
+Run inference on [your dataset](#datasets):
+```bash
+retinanet infer retinanet_rn50fpn.pth --images /dataset/val --output detections.json
+```
+
+### Optimized Inference with TensorRT
+
+For faster inference, export the detection model to an optimized FP16 TensorRT engine:
+```bash
+retinanet export model.pth engine.plan
+```
+
+Evaluat the model with TensorRT backend on [COCO 2017](http://cocodataset.org/#download):
+```bash
+retinanet infer engine.plan --images /coco/images/val2017/ --annotations /coco/annotations/instances_val2017.json
+```
+
+### INT8 Inference with TensorRT
+
+For even faster inference, do INT8 calibration to create an optimized INT8 TensorRT engine:
+```bash
+retinanet export model.pth engine.plan --int8 --calibration-images /coco/images/val2017/
+```
+This will create an INT8CalibrationTable file that can be used to create INT8 TensorRT engines for the same model later on without needing to do calibration.
+
+Or create an optimized INT8 TensorRT engine using a cached calibration table:
+```bash
+retinanet export model.pth engine.plan --int8 --calibration-table /path/to/INT8CalibrationTable
+```
+
+## Backbones
+
+Training numbers for [COCO 2017](http://cocodataset.org/#detection-2017) (train/val) after full training schedule with default parameters.
+
+Inference numbers include bounding boxes post-processing for batch = 1.
+
+Backbone | Resize | mAP @[IoU=0.50:0.95] | Training Time [[DGX1v](https://www.nvidia.com/en-us/data-center/dgx-1/)] | Inference Latency FP16 [[V100](https://www.nvidia.com/en-us/data-center/tesla-v100/)] | Inference Latency FP16 [[T4](https://www.nvidia.com/en-us/data-center/tesla-t4/)] | Inference Latency INT8 [[T4](https://www.nvidia.com/en-us/data-center/tesla-t4/)]
+--- | :---: | :---: | :---: | :---: | :---: | :---:
+ResNet18FPN | 800 | 0.318 | 5 hrs  | 12 ms/im | 17 ms/im | 12 ms/im
+ResNet34FPN | 800 | 0.343 | 6 hrs  | 14 ms/im | 20 ms/im | 14 ms/im
+ResNet50FPN | 800 | 0.358 | 7 hrs  | 16 ms/im | 26 ms/im | 16 ms/im
+ResNet101FPN | 800 | 0.376 | 10 hrs | 20 ms/im | 34 ms/im | 20 ms/im
+ResNet152FPN | 800 | 0.393 | 12 hrs | 25 ms/im | 42 ms/im | 24 ms/im
+
+## Datasets
+
+RetinaNet supports annotations in the [COCO JSON format](http://cocodataset.org/#format-data).
+When converting the annotations from your own dataset into JSON, the following entries are required:
+```
+{
+    "images": [{
+        "id" : int,
+        "file_name" : str
+    }],
+    "annotations": [{
+        "id" : int,
+        "image_id" : int, 
+        "category_id" : int,
+        "bbox" : [x, y, w, h]
+    }],
+    "categories": [{
+        "id" : int
+    ]}
+}
+```
+
+## References
+
+- [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002).
+  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár.
+  ICCV, 2017.
+- [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
+  Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He.
+  June 2017.
+- [Feature Pyramid Networks for Object Detection](https://arxiv.org/abs/1612.03144).
+  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie.
+  CVPR, 2017.
+- [Deep Residual Learning for Image Recognition](http://arxiv.org/abs/1512.03385).
+  Kaiming He, Xiangyu Zhang, Shaoqing Renm Jian Sun.
+  CVPR, 2016.
diff --git a/TRAINING.md b/TRAINING.md
@@ -0,0 +1,51 @@
+# Training
+
+There are two main ways to train a model with `retinanet-examples`:
+* Fine-tuning the detection model using a model already trained on a large dataset (like MS-COCO)
+* Fully training the detection model from random initialization using a pre-trained backbone (usually on ImageNet)
+
+## Fine-tuning
+
+Fine-tuning and existing model trained on COCO allows you to use tranfer learning to get a accurate model for your own dataset with minimal training.
+When fine-tuning, we re-initialize the last layer of the classification head so the network will re-learn how to map features to classes scores regardless of the number of classes in your own dataset.
+
+You can fine-tune a pre-trained model on [your dataset](#datasets), here we'll use [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) using [JSON annotations](https://storage.googleapis.com/coco-dataset/external/PASCAL_VOC.zip):
+```bash
+retinanet train model_mydataset.pth \
+    --fine-tune retinanet_rn50fpn.pth \
+    --classes 20 --iters 10000 --val-iters 1000 --lr 0.0005 \
+    --resize 512 --jitter 480 640 --images /voc/JPEGImages/ \
+    --annotations /voc/pascal_train2012.json --val-annotations /voc/pascal_val2012.json
+```
+
+Even though the COCO model was trained on 80 classes, we can easily use tranfer learning to fine-tune it on the Pascal VOC model representing only 20 classes.
+
+The shorter side of the input images will be resized to `resize` as long as the longer side doesn't get larger than `max-size`.
+During training, the images will be randomly resized to a new size within the `jitter` range.
+
+We usually want to fine-tune the model with a lower learning rate `lr` than during full training and for less iterations `iters`.
+
+## Full Training
+
+If you do not have a pre-trained model, if your dataset is substantially large or if you have written your own backbone, you should fully train the detection model.
+
+Full training usually starts from a pre-trained backbone (automatically downloaded with the current backbones we offer) that has been pre-trained on a classification task with a large dataset like [ImageNet](http://www.image-net.org).
+This is especially necessary for backbones using batch normalization as they require large batch sizes during training that cannot be provided when training on the detection task as the input images have to be relatively large.
+
+Train a detection model on [COCO 2017](http://cocodataset.org/#download) from pre-trained backbone:
+```bash
+retinanet train retinanet_rn50fpn.pth --backbone ResNet50FPN \
+    --images /coco/images/train2017/ --annotations /coco/annotations/instances_train2017.json \
+    --val-images /coco/images/val2017/ --val-annotations /coco/annotations/instances_val2017.json
+```
+
+We use mixed precision training by default. Full precision training can be used by providing the `full-precision` option although it doesn't provide improved accuracy in our experience.
+
+If you want to setup your own training schedule, the following options are useful:
+* `iters` is the total number of iterations you want to train the model for (1 iteration with a `batch` size of 16 correspond to going through 16 images of your dataset)
+* `milestone` is a list of number of iteration at which we want to decay the learning rate
+* `lr` represents the initial learning rate and `gamma` is the factor by which we multiply the learning rate at each decay milestone
+* `schedule` is a float value that `iters` and `milestones` will be multiplied with to easily scale the learning schedule
+* `warmup` is the number of initial iterations during which we want to linearly ramp-up the learning rate to avoid early divergence of the loss.
+
+You can also monitor the loss and learning rate schedule of the training using TensorBoard bu specifying a `logdir` path.