Skip to content

oneDNN performance

Ashok Bhat edited this page Oct 20, 2020 · 19 revisions

oneDNN x86 performance

Observations for RESNET-50

Observation
Max inference throughput Use 4 cores per instance
with each instance doing 8-10 images per batch
Max inference throughput (<7ms latency) Use 24-28 cores per instance
with each instance doing 8-12 images per batch
INT8 vs FP32 ~4x scaling (3.9 for TF, 3.7 for PyTorch)
oneDNN vs Eigen on TF 2x faster at batch size 1 and 128

Performance metrics

Max Inference throughput (RESNET-50, INT8)

System Cores Images/sec Instances Cores per instance Batch per instance
Intel Xeon Platinum 8280 (2S) 56 3689 14 4 10
Intel Xeon Platinum 9242 (2S) 96 5943 24 4 8
Intel Xeon Platinum 9282 (2S) 112 7736 28 4 8
  • Configuration: INT8, MKLDNN 0.17, 2-socket, HT/Turbo On, Intel's Caffe, RESNET-50, Tested on 3/04/2019
  • Source: Intel's page

Max Inference throughput (RESNET-50, INT8) (Latency <7ms)

System Cores Images/sec Latency Instances Cores per instance Batch per instance
Intel Xeon Platinum 8280 (2S) 56 3248 6.16ms 2 28 10
Intel Xeon Platinum 9242 (2S) 96 4637 6.90ms 4 24 8
Intel Xeon Platinum 9282 (2S) 112 6950 6.91ms 4 28 12
  • Configuration: INT8, MKLDNN 0.17, 2-socket, HT/Turbo On, Intel's Caffe, RESNET-50, Tested on 3/04/2019
  • Source: Intel's page

Inference using different frameworks (INT8 vs FP32)

Comparison of oneDNN vs Eigen

Latency performance of TensorFlow inference (Batch size #1)

  • Source: Intel's page
  • Config: Tested on Feb 2019, Intel Xeon Platinum 8180 (Skylake, 28 core), TensorFlow
  • Dataset: Inception V3: synthetic data. Resnet50: synthetic data. NCF: MovieLens 1M.Transformer-LT: English-German. Mask R-CNN: MS COCO 2014. SSD-Mobilenet: MS COCO 2017.

Throughput performance of TensorFlow inference (Batch size > 1)

  • Source: Intel's page
  • Config: Tested on Feb 2019, Intel Xeon Platinum 8180 (Skylake, 28 core), TensorFlow
  • Dataset: Inception V3: synthetic data. Resnet50: synthetic data. NCF: MovieLens 1M.Transformer-LT: English-German. Mask R-CNN: MS COCO 2014. SSD-Mobilenet: MS COCO 2017.

Inference throughput (Skylake vs Haswell)

Source: https://software.intel.com/content/www/us/en/develop/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference.html

  • INFERENCE using FP32 Batch Size Caffe* GoogleNet v1 128 AlexNet 256.

  • Tested by Intel as of 6/7/2018: Platform: 2 socket Intel® Xeon® Platinum 8180 processor with 2.50GHz / 28 cores,

    • Measured: 1449 imgs/sec
    • 4 instances of the framework
    • Intel optimized Caffe
    • Topology: GoogLeNet v1,
    • Intel MKL-DNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396, No Data Layer.
  • Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor E5-2699 v3 with 2.30GHz / 18 cores

    • HT: enabled,
    • Turbo: disabled, scaling governor set to “performance” via intel_pstate driver,
    • CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64,
    • Framework Berkeley Vision and Learning Center (BVLC)
    • Caffe: https://github.com/BVLC/caffe, inference & training measured with “caffe time” command.
    • For “ConvNet” topologies, a dummy dataset was used.
    • For other topologies, data was stored on local storage and cached in memory before training.
    • BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594.
Clone this wiki locally