oneDNN performance

Jump to bottom

Ashok Bhat edited this page Oct 20, 2020 · 19 revisions

oneDNN x86 performance

Observations for RESNET-50

	Observation
Max inference throughput	Use 4 cores per instance with each instance doing 8-10 images per batch
Max inference throughput (<7ms latency)	Use 24-28 cores per instance with each instance doing 8-12 images per batch
INT8 vs FP32	~4x scaling (3.9 for TF, 3.7 for PyTorch)
oneDNN vs Eigen on TF	2x faster at batch size 1 and 128

Performance metrics

Max Inference throughput (RESNET-50, INT8)

System	Cores	Images/sec	Instances	Cores per instance	Batch per instance
Intel Xeon Platinum 8280 (2S)	56	3689	14	4	10
Intel Xeon Platinum 9242 (2S)	96	5943	24	4	8
Intel Xeon Platinum 9282 (2S)	112	7736	28	4	8

Configuration: INT8, MKLDNN 0.17, 2-socket, HT/Turbo On, Intel's Caffe, RESNET-50, Tested on 3/04/2019
Source: Intel's page

Max Inference throughput (RESNET-50, INT8) (Latency <7ms)

System	Cores	Images/sec	Latency	Instances	Cores per instance	Batch per instance
Intel Xeon Platinum 8280 (2S)	56	3248	6.16ms	2	28	10
Intel Xeon Platinum 9242 (2S)	96	4637	6.90ms	4	24	8
Intel Xeon Platinum 9282 (2S)	112	6950	6.91ms	4	28	12

Configuration: INT8, MKLDNN 0.17, 2-socket, HT/Turbo On, Intel's Caffe, RESNET-50, Tested on 3/04/2019
Source: Intel's page

Inference using different frameworks (INT8 vs FP32)

Comparison of oneDNN vs Eigen

Latency performance of TensorFlow inference (Batch size #1)

Source: Intel's page
Config: Tested on Feb 2019, Intel Xeon Platinum 8180 (Skylake, 28 core), TensorFlow
Dataset: Inception V3: synthetic data. Resnet50: synthetic data. NCF: MovieLens 1M.Transformer-LT: English-German. Mask R-CNN: MS COCO 2014. SSD-Mobilenet: MS COCO 2017.

Throughput performance of TensorFlow inference (Batch size > 1)

Source: Intel's page
Config: Tested on Feb 2019, Intel Xeon Platinum 8180 (Skylake, 28 core), TensorFlow
Dataset: Inception V3: synthetic data. Resnet50: synthetic data. NCF: MovieLens 1M.Transformer-LT: English-German. Mask R-CNN: MS COCO 2014. SSD-Mobilenet: MS COCO 2017.

Inference throughput (Skylake vs Haswell)

Source: https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

INFERENCE using FP32 Batch Size Caffe* GoogleNet v1 128 AlexNet 256.
Tested by Intel as of 6/7/2018: Platform: 2 socket Intel® Xeon® Platinum 8180 processor with 2.50GHz / 28 cores,
- Measured: 1449 imgs/sec
- 4 instances of the framework
- Intel optimized Caffe
- Topology: GoogLeNet v1,
- Intel MKL-DNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396, No Data Layer.
Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor E5-2699 v3 with 2.30GHz / 18 cores
- HT: enabled,
- Turbo: disabled, scaling governor set to “performance” via intel_pstate driver,
- CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64,
- Framework Berkeley Vision and Learning Center (BVLC)
- Caffe: https://github.com/BVLC/caffe, inference & training measured with “caffe time” command.
- For “ConvNet” topologies, a dummy dataset was used.
- For other topologies, data was stored on local storage and cached in memory before training.
- BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594.

Clone this wiki locally