This project showcases inference with PyTorch CNN models, such as ResNet50, EfficientNet, and MobileNet, and their optimization using ONNX, OpenVINO, and NVIDIA TensorRT. The script infers a user-specified image and displays top-K predictions. Benchmarking covers configurations like PyTorch CPU, ONNX CPU, OpenVINO CPU, PyTorch CUDA, TensorRT-FP32, and TensorRT-FP16.
The project is Dockerized for easy deployment:
- CPU-only Deployment - Suitable for non-GPU systems (supports
PyTorch CPU
,ONNX CPU
, andOpenVINO CPU
models only). - GPU Deployment - Optimized for NVIDIA GPUs (supports all models:
PyTorch CPU
,ONNX CPU
,OpenVINO CPU
,PyTorch CUDA
,TensorRT-FP32
, andTensorRT-FP16
).
Please look at the Steps to Run section for Docker instructions.
- Clone this repo:
git clone https://github.com/DimaBir/ResNetTensorRT.git
- Python 3.x
- Verify that your system has Docker support
- NVIDIA GPU (for CUDA and TensorRT benchmarks and optimizations)
- NVIDIA drivers installed on the host machine.
-
CPU-only Deployment:
docker build -t cpu_img .
Running:
docker run -it --rm cpu_img /bin/bash
-
GPU (CUDA) Deployment:
docker build --build-arg ENVIRONMENT=gpu --build-arg BASE_IMAGE=nvcr.io/nvidia/tensorrt:23.08-py3 -t gpu_img .
Running:
docker run --gpus all -it --rm gpu_img
python main.py [--mode all]
--image_path
: (Optional) Specifies the path to the image you want to predict.--topk
: (Optional) Specifies the number of top predictions to show. Defaults to 5 if not provided.--mode
: (Optional) Specifies the model's mode for exporting and running. Choices are:onnx
,ov
,cpu
,cuda
,tensorrt
, andall
. If not provided, it defaults toall
.
python main.py --topk 3 --mode=all --image_path="./inference/cat3.jpg"
This command will run predictions on the chosen image (./inference/cat3.jpg
), show the top 3 predictions, and run all available models. Note: plot created only for --mode=all
and results plotted and saved to ./inference/plot.png
Here is an example of the input image to run predictions and benchmarks on:
-
Average Inference Time: This plot showcases the average time taken for inference across different model types and optimization techniques. The y-axis represents the model type (e.g., PyTorch CPU, TensorRT FP16, etc.), and the x-axis represents the average inference time in milliseconds. The shorter the bar, the faster the inference time.
-
Throughput: This plot compares the throughput achieved by different model types. Throughput is measured in terms of the number of images processed per second. The y-axis represents the model type, and the x-axis represents the throughput. A higher bar indicates better throughput, meaning the model can process more images in a given time frame.
These plots offer a comprehensive view of the performance improvements achieved by various inference optimization techniques, especially when leveraging TensorRT with different precision types like FP16 and FP32.
#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx
- CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
- RAM: 16 GB
- GPU: None
PyTorch_cpu: 31.93 ms
indicates the average batch time when running thePyTorch
model onCPU
device.PyTorch_cuda: 5.70 ms
indicates the average batch time when running thePyTorch
model on theCUDA
device.TRT_fp32: 1.69 ms
shows the average batch time when running the model withTensorRT
usingfloat32
precision.TRT_fp16: 0.75 ms
indicates the average batch time when running the model withTensorRT
usingfloat16
precision.ONNX: 16.25 ms
indicates the average batch inference time when running thePyTorch
converted to theONNX
model on theCPU
device.OpenVINO: 15.00 ms
indicates the average batch inference time when running theONNX
model converted toOpenVINO
on theCPU
device.
#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx
- CPU: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
- RAM: 32 GB
- GPU: GeForce RTX 3070 (CUDA)
#1: 15% Egyptian cat
#2: 14% tiger cat
#3: 9% tabby
#4: 2% doormat
#5: 2% lynx
- CPU: M1 Pro Chip
- RAM: 16 GB
- GPU: None
Here you can see the flow for each model and benchmark.
In the provided code, we perform inference using the native PyTorch framework on both CPU and GPU (CUDA) configurations. This is a baseline to compare the performance improvements gained from other optimization techniques.
- The ResNet-50 model is loaded from torchvision and, if available, transferred to the GPU.
- Inference is performed on the provided image using the specified model.
- Benchmark results, including average inference time, are logged for the CPU and CUDA setups.
TensorRT offers significant performance improvements by optimizing the neural network model. This code uses TensorRT's capabilities to run benchmarks in FP32 (single precision) and FP16 (half precision) modes.
- Load the ResNet-50 model.
- Convert the PyTorch model to TensorRT format with the specified precision.
- Perform inference on the provided image.
- Log the benchmark results for the specified TensorRT precision mode.
The code includes an exporter that converts the PyTorch ResNet-50 model to ONNX format, allowing it to be inferred using ONNX Runtime. This provides a flexible, cross-platform solution for deploying the model.
- The ResNet-50 model is loaded.
- Using the ONNX exporter utility, the PyTorch model is converted to ONNX format.
- ONNX Runtime session is created.
- Inference is performed on the provided image using the ONNX model.
- Benchmark results are logged for the ONNX model.
OpenVINO is a toolkit from Intel that optimizes deep learning model inference for Intel CPUs, GPUs, and other hardware. We convert the ONNX model to OpenVINO's format in the code and then run benchmarks using the OpenVINO runtime.
- The ONNX model (created in the previous step) is loaded.
- Convert the ONNX model to OpenVINO's IR format.
- Create an inference engine using OpenVINO's runtime.
- Perform inference on the provided image using the OpenVINO model.
- Benchmark results, including average inference time, are logged for the OpenVINO model.
- PyTorch: Official Documentation
- Torch-TensorRT: is a compiler for PyTorch/TorchScript, targeting NVIDIA GPUs via NVIDIA’s TensorRT Deep Learning Optimizer and Runtime. Torch-TensorRT Documentation
- torch.onnx: PyTorch's built-in ONNX exporter. Documentation
- OpenVINO: Intel's toolkit for computer vision applications includes a model optimizer to convert trained models into a format suitable for optimal execution on end-point target devices. Official Documentation
- OpenVINO - Converting ONNX to OV: Convert Model From ONNX