-
Notifications
You must be signed in to change notification settings - Fork 1
oneDNN performance
Ashok Bhat edited this page Oct 20, 2020
·
19 revisions
Observation | |
---|---|
Max inference throughput | Use 4 cores per instance with each instance doing 8-10 images per batch |
Max inference throughput (<7ms latency) | Use 24-28 cores per instance with each instance doing 8-12 images per batch |
INT8 vs FP32 | ~4x scaling (3.9 for TF, 3.7 for PyTorch) |
oneDNN vs Eigen on TF | 2x faster at batch size 1 and 128 |
System | Cores | Images/sec | Instances | Cores per instance | Batch per instance |
---|---|---|---|---|---|
Intel Xeon Platinum 8280 (2S) | 56 | 3689 | 14 | 4 | 10 |
Intel Xeon Platinum 9242 (2S) | 96 | 5943 | 24 | 4 | 8 |
Intel Xeon Platinum 9282 (2S) | 112 | 7736 | 28 | 4 | 8 |
- Configuration: INT8, MKLDNN 0.17, 2-socket, HT/Turbo On, Intel's Caffe, RESNET-50, Tested on 3/04/2019
- Source: Intel's page
System | Cores | Images/sec | Latency | Instances | Cores per instance | Batch per instance |
---|---|---|---|---|---|---|
Intel Xeon Platinum 8280 (2S) | 56 | 3248 | 6.16ms | 2 | 28 | 10 |
Intel Xeon Platinum 9242 (2S) | 96 | 4637 | 6.90ms | 4 | 24 | 8 |
Intel Xeon Platinum 9282 (2S) | 112 | 6950 | 6.91ms | 4 | 28 | 12 |
- Configuration: INT8, MKLDNN 0.17, 2-socket, HT/Turbo On, Intel's Caffe, RESNET-50, Tested on 3/04/2019
- Source: Intel's page
- Source: Intel's page
- Config: Tested on Feb 2019, Intel Xeon Platinum 8180 (Skylake, 28 core), TensorFlow
- Dataset: Inception V3: synthetic data. Resnet50: synthetic data. NCF: MovieLens 1M.Transformer-LT: English-German. Mask R-CNN: MS COCO 2014. SSD-Mobilenet: MS COCO 2017.
- Source: Intel's page
- Config: Tested on Feb 2019, Intel Xeon Platinum 8180 (Skylake, 28 core), TensorFlow
- Dataset: Inception V3: synthetic data. Resnet50: synthetic data. NCF: MovieLens 1M.Transformer-LT: English-German. Mask R-CNN: MS COCO 2014. SSD-Mobilenet: MS COCO 2017.
-
INFERENCE using FP32 Batch Size Caffe* GoogleNet v1 128 AlexNet 256.
-
Tested by Intel as of 6/7/2018: Platform: 2 socket Intel® Xeon® Platinum 8180 processor with 2.50GHz / 28 cores,
- Measured: 1449 imgs/sec
- 4 instances of the framework
- Intel optimized Caffe
- Topology: GoogLeNet v1,
- Intel MKL-DNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396, No Data Layer.
-
Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor E5-2699 v3 with 2.30GHz / 18 cores
- HT: enabled,
- Turbo: disabled, scaling governor set to “performance” via intel_pstate driver,
- CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64,
- Framework Berkeley Vision and Learning Center (BVLC)
- Caffe: https://github.com/BVLC/caffe, inference & training measured with “caffe time” command.
- For “ConvNet” topologies, a dummy dataset was used.
- For other topologies, data was stored on local storage and cached in memory before training.
- BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594.