English | 简体中文
docker pull paddlepaddle/triton_paddle:21.10
Note: Only Triton Inference Server 21.10 image is supported.
The model repository is the directory where you place the models that you want Triton to server. An example model repository is included in the examples. Before using the repository, you must fetch it by the following scripts.
$ cd examples
$ ./fetch_models.sh
$ cd .. # back to root of paddle_backend
- Launch the image
$ docker run --gpus=all --rm -it --name triton_server --net=host -e CUDA_VISIBLE_DEVICES=0 \
-v `pwd`/examples/models:/workspace/models \
paddlepaddle/triton_paddle:21.10 /bin/bash
- Launch the triton inference server
/opt/tritonserver/bin/tritonserver --model-repository=/workspace/models
Note: /opt/tritonserver/bin/tritonserver --help
for all available parameters
Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.
$ curl -v localhost:8000/v2/health/ready
...
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
The HTTP request returns status 200 if Triton is ready and non-200 if it is not ready.
Before running the examples, please make sure the triton server is running correctly.
Change working directory to examples
$ cd examples
ERNIE-2.0 is a pre-training framework for language understanding.
Steps to run the benchmark on ERNIE
$ bash perf_ernie.sh
The ResNet50-v1.5 is a modified version of the original ResNet50 v1 model.
Steps to run the benchmark on ResNet50-v1.5
$ bash perf_resnet50_v1.5.sh
Steps to run the inference on ResNet50-v1.5.
-
Prepare processed images following DeepLearningExamples and place
imagenet
folder under examples directory. -
Run the inference
$ bash infer_resnet_v1.5.sh imagenet/<id>
Precision | Backend Accelerator | Client Batch Size | Sequences/second | P90 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Avg Latency (ms) |
---|---|---|---|---|---|---|---|
FP16 | TensorRT | 1 | 270.0 | 3.813 | 3.846 | 4.007 | 3.692 |
FP16 | TensorRT | 2 | 500.4 | 4.282 | 4.332 | 4.709 | 3.980 |
FP16 | TensorRT | 4 | 831.2 | 5.141 | 5.242 | 5.569 | 4.797 |
FP16 | TensorRT | 8 | 1128.0 | 7.788 | 7.949 | 8.255 | 7.089 |
FP16 | TensorRT | 16 | 1363.2 | 12.702 | 12.993 | 13.507 | 11.738 |
FP16 | TensorRT | 32 | 1529.6 | 22.495 | 22.817 | 24.634 | 20.901 |
Precision | Backend Accelerator | Client Batch Size | Sequences/second | P90 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Avg Latency (ms) |
---|---|---|---|---|---|---|---|
FP16 | TensorRT | 1 | 288.8 | 3.494 | 3.524 | 3.608 | 3.462 |
FP16 | TensorRT | 2 | 494.0 | 4.083 | 4.110 | 4.208 | 4.047 |
FP16 | TensorRT | 4 | 758.4 | 5.327 | 5.359 | 5.460 | 5.273 |
FP16 | TensorRT | 8 | 1044.8 | 7.728 | 7.770 | 7.949 | 7.658 |
FP16 | TensorRT | 16 | 1267.2 | 12.742 | 12.810 | 13.883 | 12.647 |
FP16 | TensorRT | 32 | 1113.6 | 28.840 | 29.044 | 30.357 | 28.641 |
FP16 | TensorRT | 64 | 1100.8 | 58.512 | 58.642 | 59.967 | 58.251 |
FP16 | TensorRT | 128 | 1049.6 | 121.371 | 121.834 | 123.371 | 119.991 |
Precision | Backend Accelerator | Client Batch Size | Sequences/second | P90 Latency (ms) | P95 Latency (ms) | P99 Latency (ms) | Avg Latency (ms) |
---|---|---|---|---|---|---|---|
FP16 | TensorRT | 1 | 291.8 | 3.471 | 3.489 | 3.531 | 3.427 |
FP16 | TensorRT | 2 | 466.0 | 4.323 | 4.336 | 4.382 | 4.288 |
FP16 | TensorRT | 4 | 665.6 | 6.031 | 6.071 | 6.142 | 6.011 |
FP16 | TensorRT | 8 | 833.6 | 9.662 | 9.684 | 9.767 | 9.609 |
FP16 | TensorRT | 16 | 899.2 | 18.061 | 18.208 | 18.899 | 17.748 |
FP16 | TensorRT | 32 | 761.6 | 42.333 | 43.456 | 44.167 | 41.740 |
FP16 | TensorRT | 64 | 793.6 | 79.860 | 80.410 | 80.807 | 79.680 |
FP16 | TensorRT | 128 | 793.6 | 158.207 | 158.278 | 158.643 | 157.543 |