Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Upgrading MKLDNN to 1.0 causes performance regression. #16891

Closed
leleamol opened this issue Nov 22, 2019 · 36 comments
Closed

Upgrading MKLDNN to 1.0 causes performance regression. #16891

leleamol opened this issue Nov 22, 2019 · 36 comments

Comments

@leleamol
Copy link
Contributor

Description

The change that upgraded MKLDNN to 1.0 caused performance (images/sec) to drop by 200 points.

Error Message

The through-put performance (images/sec) during training dropped to 1300 images/sec.
Prior to this change the throughput was in the range of 1500-1530 images/sec.

To Reproduce

The attached gzip file contains the training script that trains resnet18_v2 network on Cifar10 dataset.
image_classification.tar.gz
The above numbers were measured on C5.18xlarge ubuntu instance.

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Build and install the mxnet-mkl pip wheel that contains the above changes on the test machine.
  2. Unzip the attached gzip file on the test machine.
  3. Install the psutil and gluoncv and Export the KMP_AFFINITY anf OMP_NUM_THREADS variable as below.
pip install psutil gluoncv
export KMP_AFFINITY='granularity=fine,compact,1,0' && export OMP_NUM_THREADS=36
  1. Run the following command to start the training.
python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64

The sample output looks like below.

/usr/local/lib/python2.7/dist-packages/mxnet/numpy_op_signature.py:61: UserWarning: Some mxnet.numpy operator signatures may not be displayed consistently with their counterparts in the official NumPy package due to too-low Python version 2.7.12 (default, Oct  8 2019, 14:14:10)
[GCC 5.4.0 20160609]. Python >= 3.5 is required to make the signatures display correctly.
  .format(str(sys.version)))
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
[01:23:04] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[01:23:04] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[01:23:04] src/executor/graph_executor.cc:1936: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 892.55 samples/sec       accuracy=0.288909
INFO:root:Epoch[0] Batch [50-100]       Speed: 1390.86 samples/sec      accuracy=0.390625
INFO:root:Epoch[0] Batch [100-150]      Speed: 987.58 samples/sec       accuracy=0.421250
INFO:root:Epoch[0] Batch [150-200]      Speed: 1407.58 samples/sec      accuracy=0.440312
INFO:root:Epoch[0] Batch [200-250]      Speed: 1310.79 samples/sec      accuracy=0.468438
INFO:root:Epoch[0] Batch [250-300]      Speed: 1331.61 samples/sec      accuracy=0.500313
INFO:root:Epoch[0] Batch [300-350]      Speed: 1420.91 samples/sec      accuracy=0.522500
INFO:root:Epoch[0] Batch [350-400]      Speed: 1469.40 samples/sec      accuracy=0.527813
INFO:root:Epoch[0] Batch [400-450]      Speed: 1195.95 samples/sec      accuracy=0.550312
INFO:root:Epoch[0] Batch [450-500]      Speed: 1146.35 samples/sec      accuracy=0.573125
INFO:root:Epoch[0] Batch [500-550]      Speed: 1543.27 samples/sec      accuracy=0.568125
INFO:root:Epoch[0] Batch [550-600]      Speed: 1251.45 samples/sec      accuracy=0.574688
INFO:root:Epoch[0] Batch [600-650]      Speed: 1303.13 samples/sec      accuracy=0.602187
INFO:root:Epoch[0] Batch [650-700]      Speed: 1283.89 samples/sec      accuracy=0.618750
INFO:root:Epoch[0] Batch [700-750]      Speed: 955.70 samples/sec       accuracy=0.607187
INFO:root:Epoch[0] Train-accuracy=0.514007

Environment

  1. c5.18xlarge
  2. ubuntu 14.04 LTS
@leleamol leleamol added the Bug label Nov 22, 2019
@leleamol
Copy link
Contributor Author

#16555

@TaoLv @pengzhao-intel @zixuanweeei @samskalicky

@samskalicky
Copy link
Contributor

@mxnet-label-bot add [R1.6.0]

@TaoLv
Copy link
Member

TaoLv commented Nov 23, 2019

@leleamol How did you install the mxnet package, from source code or the nightly build? If build from source code, could you please share the make line also? #16555 removed the libiomp5 library from mxnet default build to comply with Apache License requirements. That could be the reason of this issue but I still need reproduce to confirm. If possible, could you please try to build mxnet with USE_BLAS=mkl? It will pull in the libiomp5 library. To install MKL BLAS, please refer to https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_mkl.sh. Thanks!

@pengzhao-intel
Copy link
Contributor

Our test results, #16845 (comment)

@leleamol
Copy link
Contributor Author

@TaoLv I have build the mxnet package from source.

I followed the instructions that are mentioned in the README.md

I just put them in the script form for quicker execution like below.

For building the mkl variant, invoke the following script with "mkl" as command line parameter.

#!/usr/bin/env bash
  

CURRNET_DIR=`pwd`
echo $CURRNET_DIR
PIP_BUILD=$HOME/pip_build
MXNET_BUILD=$PIP_BUILD/mxnet-build
cd $HOME

mkdir $PIP_BUILD
mv $HOME/incubator-mxnet $MXNET_BUILD
cd $MXNET_BUILD
echo "Building mxnet."
source tools/staticbuild/build.sh $1 pip

cd $PIP_BUILD
cp -r $MXNET_BUILD/tools/pip/. .
export mxnet_variant=$1
python setup.py bdist_wheel

@samskalicky
Copy link
Contributor

@zachgk assign [@apeforest ]

@rongzha1
Copy link
Contributor

rongzha1 commented Nov 26, 2019

cpu test on both v1.5.x and v1.6.x mkldnn + openblas, but no regression issue was found.
So can you try to use USE_BLAS=mkl as Taolv said above and test again?

I have tried to use build.sh but failed for: CMake Error at simd/CMakeLists.txt:41 (enable_language):
No CMAKE_ASM_NASM_COMPILER could be found.
So for v1.5 and v1.6 I build use cmd:
make -j USE_MKLDNN=1 USE_BLAS=openblas USE_GPERFTOOLS=0
and setting openblas include and lib directory.
platform: skx-8180
1.5:
[rongzha1@mlt-ace ds2_training_inference]$ cd mxnet_1.5/
[rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep open
libopenblas.so.0 => /lib64/libopenblas.so.0 (0x00007f8db5ff9000)
libopencv_highgui.so.2.4 => /lib64/libopencv_highgui.so.2.4 (0x00007f8dacdaf000)
libopencv_imgproc.so.2.4 => /lib64/libopencv_imgproc.so.2.4 (0x00007f8dac931000)
libopencv_core.so.2.4 => /lib64/libopencv_core.so.2.4 (0x00007f8dac4f7000)
[rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep mkl
libmklml_intel.so => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libmklml_intel.so (0x00007f9707c8d000)
libmkldnn.so.0 => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libmkldnn.so.0 (0x00007f970671d000)
(mxnet) [rongzha1@mlt-ace mxnet_1.5]$ ldd lib/libmxnet.so | grep omp
libiomp5.so => /home/rongzha1/project/mxnet/ds2_training_inference/mxnet_1.5/lib/libiomp5.so (0x00007f75cbc42000)
libXcomposite.so.1 => /lib64/libXcomposite.so.1 (0x00007f75c2647000)

1.6.x:
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep open
libopenblas.so.0 => /usr/lib64/libopenblas.so.0 (0x00007fc101c03000)
libopencv_highgui.so.2.4 => /usr/lib64/libopencv_highgui.so.2.4 (0x00007fc1004cf000)
libopencv_imgproc.so.2.4 => /usr/lib64/libopencv_imgproc.so.2.4 (0x00007fc100051000)
libopencv_core.so.2.4 => /usr/lib64/libopencv_core.so.2.4 (0x00007fc0ffc18000)
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep mkl
libmkldnn.so.1 => /home/rongzha1/project/mxnet/ds2_training_inference/perf_regression/lib/libmkldnn.so.1 (0x00007f8378240000)
[rongzha1@mlt-skx141 perf_regression]$ ldd lib/libmxnet.so | grep omp
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007f1357b17000)
libXcomposite.so.1 => /usr/lib64/libXcomposite.so.1 (0x00007f13509a1000)

v1.5.x:
OMP=56
1 [21:43:26] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [21:43:26] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 INFO:root:Epoch[0] Batch [0-50] Speed: 1668.60 samples/sec accuracy=0.273897
4 INFO:root:Epoch[0] Batch [50-100] Speed: 1699.64 samples/sec accuracy=0.380312
5 INFO:root:Epoch[0] Batch [100-150] Speed: 1692.57 samples/sec accuracy=0.425000
6 INFO:root:Epoch[0] Batch [150-200] Speed: 1696.67 samples/sec accuracy=0.444063
7 INFO:root:Epoch[0] Batch [200-250] Speed: 1698.27 samples/sec accuracy=0.465000
8 INFO:root:Epoch[0] Batch [250-300] Speed: 1693.87 samples/sec accuracy=0.497812
9 INFO:root:Epoch[0] Batch [300-350] Speed: 1698.26 samples/sec accuracy=0.505625
10 INFO:root:Epoch[0] Batch [350-400] Speed: 1691.21 samples/sec accuracy=0.520000
11 INFO:root:Epoch[0] Batch [400-450] Speed: 1694.42 samples/sec accuracy=0.538750
12 INFO:root:Epoch[0] Batch [450-500] Speed: 1693.73 samples/sec accuracy=0.576875
13 INFO:root:Epoch[0] Batch [500-550] Speed: 1688.67 samples/sec accuracy=0.579063
14 INFO:root:Epoch[0] Batch [550-600] Speed: 1686.91 samples/sec accuracy=0.585313
15 INFO:root:Epoch[0] Batch [600-650] Speed: 1691.39 samples/sec accuracy=0.605313
16 INFO:root:Epoch[0] Batch [650-700] Speed: 1693.22 samples/sec accuracy=0.612812
17 INFO:root:Epoch[0] Batch [700-750] Speed: 1692.32 samples/sec accuracy=0.603750
18 INFO:root:Epoch[0] Train-accuracy=0.511549
19 INFO:root:Epoch[0] Time cost=29.955
20 INFO:root:Epoch[0] Validation-accuracy=0.642317

OMP=36
1 [22:10:31] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [22:10:31] src/io/iter_image_recordio_2.cc:172: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 INFO:root:Epoch[0] Batch [0-50] Speed: 1969.98 samples/sec accuracy=0.279412
4 INFO:root:Epoch[0] Batch [50-100] Speed: 2014.50 samples/sec accuracy=0.380937
5 INFO:root:Epoch[0] Batch [100-150] Speed: 2009.43 samples/sec accuracy=0.428125
6 INFO:root:Epoch[0] Batch [150-200] Speed: 2013.70 samples/sec accuracy=0.450313
7 INFO:root:Epoch[0] Batch [200-250] Speed: 2012.61 samples/sec accuracy=0.460625
8 INFO:root:Epoch[0] Batch [250-300] Speed: 2014.29 samples/sec accuracy=0.497812
9 INFO:root:Epoch[0] Batch [300-350] Speed: 2013.60 samples/sec accuracy=0.505000
10 INFO:root:Epoch[0] Batch [350-400] Speed: 2009.98 samples/sec accuracy=0.532500
11 INFO:root:Epoch[0] Batch [400-450] Speed: 2014.39 samples/sec accuracy=0.557500
12 INFO:root:Epoch[0] Batch [450-500] Speed: 2015.02 samples/sec accuracy=0.576250
13 INFO:root:Epoch[0] Batch [500-550] Speed: 2015.25 samples/sec accuracy=0.577187
14 INFO:root:Epoch[0] Batch [550-600] Speed: 2012.03 samples/sec accuracy=0.581250
15 INFO:root:Epoch[0] Batch [600-650] Speed: 2014.64 samples/sec accuracy=0.608437
16 INFO:root:Epoch[0] Batch [650-700] Speed: 2017.28 samples/sec accuracy=0.616563
17 INFO:root:Epoch[0] Batch [700-750] Speed: 2017.49 samples/sec accuracy=0.604688
18 INFO:root:Epoch[0] Train-accuracy=0.514086
19 INFO:root:Epoch[0] Time cost=24.895
20 INFO:root:Epoch[0] Validation-accuracy=0.635052

v1.6.x:
OMP = 36
1 [22:02:24] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
2 [22:02:25] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
3 [22:02:25] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
4 /home/rongzha1/anaconda3/envs/mxnet/lib/python3.6/site-packages/scipy/init.py:115: UserWarning: Numpy 1.13.3 or above is required for this version of scipy (detected version 1.13.1)
5 UserWarning)
6 INFO:root:Epoch[0] Batch [0-50] Speed: 2119.74 samples/sec accuracy=0.280025
7 INFO:root:Epoch[0] Batch [50-100] Speed: 2161.65 samples/sec accuracy=0.392500
8 INFO:root:Epoch[0] Batch [100-150] Speed: 2145.79 samples/sec accuracy=0.425938
9 INFO:root:Epoch[0] Batch [150-200] Speed: 2145.72 samples/sec accuracy=0.448125
10 INFO:root:Epoch[0] Batch [200-250] Speed: 2158.03 samples/sec accuracy=0.461250
11 INFO:root:Epoch[0] Batch [250-300] Speed: 2151.47 samples/sec accuracy=0.498125
12 INFO:root:Epoch[0] Batch [300-350] Speed: 2157.60 samples/sec accuracy=0.515312
13 INFO:root:Epoch[0] Batch [350-400] Speed: 2133.91 samples/sec accuracy=0.530625
14 INFO:root:Epoch[0] Batch [400-450] Speed: 2143.35 samples/sec accuracy=0.545625
15 INFO:root:Epoch[0] Batch [450-500] Speed: 2153.24 samples/sec accuracy=0.577187
16 INFO:root:Epoch[0] Batch [500-550] Speed: 2154.20 samples/sec accuracy=0.577500
17 INFO:root:Epoch[0] Batch [550-600] Speed: 2151.89 samples/sec accuracy=0.580625
18 INFO:root:Epoch[0] Batch [600-650] Speed: 2162.29 samples/sec accuracy=0.596250
19 INFO:root:Epoch[0] Batch [650-700] Speed: 2161.74 samples/sec accuracy=0.609062
20 INFO:root:Epoch[0] Batch [700-750] Speed: 2156.80 samples/sec accuracy=0.597812
21 INFO:root:Epoch[0] Train-accuracy=0.512828
22 INFO:root:Epoch[0] Time cost=23.642
23 INFO:root:Epoch[0] Validation-accuracy=0.613455

@ptrendx
Copy link
Member

ptrendx commented Nov 27, 2019

Considering @rongzha1 comment I don't consider this issue to be a blocker for 1.6 release. Please comment if you disagree @leleamol @samskalicky .

@samskalicky
Copy link
Contributor

@ptrendx @rongzha1 @PatricZhao thanks for looking into this, but the issue is not resolved until we verify by running the script @leleamol shared. The build.sh is the script used to generate the pip wheels. using make doesnt follow the same steps and reproduce the problem.

If you cant reproduce the build using the same scripts, I can share a pre-built pip wheel with you separately.

@samskalicky
Copy link
Contributor

Regarding the following error:

No CMAKE_ASM_NASM_COMPILER could be found.

you can install with sudo apt-get install nasm

@rongzha1
Copy link
Contributor

Hi @samskalicky I applied AWS Deep learning AMI, c5.18xlarge and ubuntu 14.04 as yours
Using @leleamol shared script to build mxnet:

  1. mxnet1.5:
    git checkout v1.5.x(commit c981848)
    when training, some error happens:
    mxnet.base.MXNetError: [08:18:23] src/operator/nn/mkldnn/mkldnn_base.cc:372: Unknown MKLDNN format for 4 dimensions: 53
    So which version did you use? what's the commit id ?

  2. mxnet1.6:
    git checkout v1.6.x(commit 200f0ec)
    both script build and make cmd build, training speed is about 1700 samples/sec

Cannot reproduce performance regression issue.

Details:
Using @leleamol shared script to build mxnet; 2 minor issue:

  1. script error : source tools/staticbuild/build.sh $1 pip sh can not recognize ' source' cmd;
    remove 'source ' can work
  2. link error: can't find /usr/lib/gcc/x86_64-linux-gnu/5/libgfortran.so
    try to link gcc5 lib, works well:
    ln -s /usr/lib/gcc/x86_64-linux-gnu/5/libgfortran.so /usr/lib/gcc/x86_64-linux-gnu/4.8/libgfortran.so
    after build: cd mxnet-build/python && python setup.py install
    run cifar training

Result is as following:
[08:45:29] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[08:45:29] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[08:45:29] src/executor/graph_executor.cc:1984: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1444.97 samples/sec accuracy=0.267770
INFO:root:Epoch[0] Batch [50-100] Speed: 1657.16 samples/sec accuracy=0.381563
INFO:root:Epoch[0] Batch [100-150] Speed: 1629.53 samples/sec accuracy=0.423438
INFO:root:Epoch[0] Batch [150-200] Speed: 1686.67 samples/sec accuracy=0.441875
INFO:root:Epoch[0] Batch [200-250] Speed: 1671.42 samples/sec accuracy=0.462187
INFO:root:Epoch[0] Batch [250-300] Speed: 1723.94 samples/sec accuracy=0.510000
INFO:root:Epoch[0] Batch [300-350] Speed: 1699.66 samples/sec accuracy=0.507500
INFO:root:Epoch[0] Batch [350-400] Speed: 1665.39 samples/sec accuracy=0.523125
INFO:root:Epoch[0] Batch [400-450] Speed: 1724.03 samples/sec accuracy=0.531250
INFO:root:Epoch[0] Batch [450-500] Speed: 1723.66 samples/sec accuracy=0.577187
INFO:root:Epoch[0] Batch [500-550] Speed: 1724.53 samples/sec accuracy=0.574375
INFO:root:Epoch[0] Batch [550-600] Speed: 1721.45 samples/sec accuracy=0.581250
INFO:root:Epoch[0] Batch [600-650] Speed: 1658.77 samples/sec accuracy=0.607500
INFO:root:Epoch[0] Batch [650-700] Speed: 1725.24 samples/sec accuracy=0.606250
INFO:root:Epoch[0] Batch [700-750] Speed: 1726.21 samples/sec accuracy=0.606563

I also use build cmd:
make -j USE_MKLDNN=1 USE_BLAS=openblas USE_GPERFTOOLS=0
cd python/ && python setup.py install
results as following:
Archive: cifar10.zip
creating: cifar/
inflating: cifar/test.rec
inflating: cifar/test.lst
inflating: cifar/train.lst
inflating: cifar/train.rec
[07:38:12] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[07:38:12] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[07:38:12] src/executor/graph_executor.cc:1984: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50] Speed: 1416.12 samples/sec accuracy=0.278799
INFO:root:Epoch[0] Batch [50-100] Speed: 1673.98 samples/sec accuracy=0.385313
INFO:root:Epoch[0] Batch [100-150] Speed: 1624.87 samples/sec accuracy=0.424687
INFO:root:Epoch[0] Batch [150-200] Speed: 1668.53 samples/sec accuracy=0.438750
INFO:root:Epoch[0] Batch [200-250] Speed: 1664.30 samples/sec accuracy=0.478438
INFO:root:Epoch[0] Batch [250-300] Speed: 1696.48 samples/sec accuracy=0.511250
INFO:root:Epoch[0] Batch [300-350] Speed: 1701.83 samples/sec accuracy=0.517188
INFO:root:Epoch[0] Batch [350-400] Speed: 1616.46 samples/sec accuracy=0.545000
INFO:root:Epoch[0] Batch [400-450] Speed: 1697.75 samples/sec accuracy=0.556875
INFO:root:Epoch[0] Batch [450-500] Speed: 1703.83 samples/sec accuracy=0.575625
INFO:root:Epoch[0] Batch [500-550] Speed: 1703.13 samples/sec accuracy=0.572812
INFO:root:Epoch[0] Batch [550-600] Speed: 1699.32 samples/sec accuracy=0.587187
INFO:root:Epoch[0] Batch [600-650] Speed: 1682.87 samples/sec accuracy=0.604688
INFO:root:Epoch[0] Batch [650-700] Speed: 1671.12 samples/sec accuracy=0.612187
INFO:root:Epoch[0] Batch [700-750] Speed: 1705.85 samples/sec accuracy=0.611875
INFO:root:Epoch[0] Train-accuracy=0.516964
INFO:root:Epoch[0] Time cost=30.561
INFO:root:Epoch[0] Validation-accuracy=0.628085

attach screenshot:
1 6_make1
1 6_make2
1 6_script_build

@oavision7946
Copy link

Hi @TaoLv, is there an ETA to have this issue fixed? It's causing quite some concern around here.

Thanks,

Omar

@larroy
Copy link
Contributor

larroy commented Dec 1, 2019

Added a script for easy repro:

http://ix.io/23fU

http://ix.io/23fV

To run:

piotr@34-215-197-42:130:~$ for i in 1 2 4 8 16 32 64 128 256 512 1024 2048; do ./imagenet.sh $i 2>&1 | tee run_$i.log; done
piotr@34-215-197-42:1:~$ ./table.py


@TaoLv
Copy link
Member

TaoLv commented Dec 1, 2019

@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so. To verify, please try to apply the below patch to make/pip_linux_mkl.mk:

diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
 # choose the version of blas you want to use
 # can be: mkl, blas, atlas, openblas
 # in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl

 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib

 # add path to intel library, you may need it for MKL, if you did not add the path
 # to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/

And then build MXNet with:

tools/staticbuild/build.sh mkl pip

If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!

@pengzhao-intel
Copy link
Contributor

@leleamol could you help to confirm the current test status based on our feedback?
I don't want it to block 1.6 release.

cc @samskalicky @apeforest

@NihalHarish
Copy link

NihalHarish commented Dec 6, 2019

@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so. To verify, please try to apply the below patch to make/pip_linux_mkl.mk:

diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
 # choose the version of blas you want to use
 # can be: mkl, blas, atlas, openblas
 # in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl

 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib

 # add path to intel library, you may need it for MKL, if you did not add the path
 # to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/

And then build MXNet with:

tools/staticbuild/build.sh mkl pip

If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!

Retried with this patch after installing MKL BLAS with https://github.com/apache/incubator-mxnet/blob/master/ci/docker/install/ubuntu_mkl.sh and got these results:

Average Throughput: 1663.49 samples/sec

INFO:root:Epoch[0] Batch [0-50]	Speed: 1414.31 samples/sec	accuracy=0.281863
INFO:root:Epoch[0] Batch [50-100]	Speed: 1610.74 samples/sec	accuracy=0.382500
INFO:root:Epoch[0] Batch [100-150]	Speed: 1625.33 samples/sec	accuracy=0.430000
INFO:root:Epoch[0] Batch [150-200]	Speed: 1649.23 samples/sec	accuracy=0.432500
INFO:root:Epoch[0] Batch [200-250]	Speed: 1663.87 samples/sec	accuracy=0.465000
INFO:root:Epoch[0] Batch [250-300]	Speed: 1640.63 samples/sec	accuracy=0.495625
INFO:root:Epoch[0] Batch [300-350]	Speed: 1671.83 samples/sec	accuracy=0.502500
INFO:root:Epoch[0] Batch [350-400]	Speed: 1669.90 samples/sec	accuracy=0.516563
INFO:root:Epoch[0] Batch [400-450]	Speed: 1600.49 samples/sec	accuracy=0.548125
INFO:root:Epoch[0] Batch [450-500]	Speed: 1669.11 samples/sec	accuracy=0.562500
INFO:root:Epoch[0] Batch [500-550]	Speed: 1671.51 samples/sec	accuracy=0.558750
INFO:root:Epoch[0] Batch [550-600]	Speed: 1667.67 samples/sec	accuracy=0.586875
INFO:root:Epoch[0] Batch [600-650]	Speed: 1670.19 samples/sec	accuracy=0.591562
INFO:root:Epoch[0] Batch [650-700]	Speed: 1652.81 samples/sec	accuracy=0.611250
INFO:root:Epoch[0] Batch [700-750]	Speed: 1630.58 samples/sec	accuracy=0.600000
INFO:root:Epoch[0] Train-accuracy=0.508252
INFO:root:Epoch[0] Time cost=30.680
INFO:root:Epoch[0] Validation-accuracy=0.632166
INFO:root:Epoch[1] Batch [0-50]	Speed: 1648.76 samples/sec	accuracy=0.625613
INFO:root:Epoch[1] Batch [50-100]	Speed: 1660.23 samples/sec	accuracy=0.629375
INFO:root:Epoch[1] Batch [100-150]	Speed: 1616.19 samples/sec	accuracy=0.640312
INFO:root:Epoch[1] Batch [150-200]	Speed: 1670.47 samples/sec	accuracy=0.643125
INFO:root:Epoch[1] Batch [200-250]	Speed: 1670.92 samples/sec	accuracy=0.657500
INFO:root:Epoch[1] Batch [250-300]	Speed: 1671.10 samples/sec	accuracy=0.655625
INFO:root:Epoch[1] Batch [300-350]	Speed: 1669.03 samples/sec	accuracy=0.651250
INFO:root:Epoch[1] Batch [350-400]	Speed: 1669.22 samples/sec	accuracy=0.655312
INFO:root:Epoch[1] Batch [400-450]	Speed: 1671.08 samples/sec	accuracy=0.672813
INFO:root:Epoch[1] Batch [450-500]	Speed: 1671.26 samples/sec	accuracy=0.673750
INFO:root:Epoch[1] Batch [500-550]	Speed: 1650.34 samples/sec	accuracy=0.682500
INFO:root:Epoch[1] Batch [550-600]	Speed: 1663.81 samples/sec	accuracy=0.681250
INFO:root:Epoch[1] Batch [600-650]	Speed: 1671.43 samples/sec	accuracy=0.695625
INFO:root:Epoch[1] Batch [650-700]	Speed: 1622.47 samples/sec	accuracy=0.698438
INFO:root:Epoch[1] Batch [700-750]	Speed: 1671.23 samples/sec	accuracy=0.687187
INFO:root:Epoch[1] Train-accuracy=0.664633
INFO:root:Epoch[1] Time cost=30.096
INFO:root:Epoch[1] Validation-accuracy=0.673878
INFO:root:Epoch[2] Batch [0-50]	Speed: 1668.44 samples/sec	accuracy=0.701900
INFO:root:Epoch[2] Batch [50-100]	Speed: 1673.86 samples/sec	accuracy=0.698750
INFO:root:Epoch[2] Batch [100-150]	Speed: 1669.55 samples/sec	accuracy=0.712500
INFO:root:Epoch[2] Batch [150-200]	Speed: 1673.31 samples/sec	accuracy=0.713750
INFO:root:Epoch[2] Batch [200-250]	Speed: 1673.31 samples/sec	accuracy=0.726562
INFO:root:Epoch[2] Batch [250-300]	Speed: 1672.89 samples/sec	accuracy=0.717187
INFO:root:Epoch[2] Batch [300-350]	Speed: 1651.81 samples/sec	accuracy=0.725938
INFO:root:Epoch[2] Batch [350-400]	Speed: 1623.66 samples/sec	accuracy=0.718750
INFO:root:Epoch[2] Batch [400-450]	Speed: 1672.81 samples/sec	accuracy=0.729688
INFO:root:Epoch[2] Batch [450-500]	Speed: 1672.86 samples/sec	accuracy=0.736563
INFO:root:Epoch[2] Batch [500-550]	Speed: 1669.99 samples/sec	accuracy=0.730625
INFO:root:Epoch[2] Batch [550-600]	Speed: 1670.90 samples/sec	accuracy=0.728750
INFO:root:Epoch[2] Batch [600-650]	Speed: 1673.84 samples/sec	accuracy=0.739375
INFO:root:Epoch[2] Batch [650-700]	Speed: 1675.46 samples/sec	accuracy=0.750313
INFO:root:Epoch[2] Batch [700-750]	Speed: 1675.23 samples/sec	accuracy=0.739062
INFO:root:Epoch[2] Train-accuracy=0.725112
INFO:root:Epoch[2] Time cost=29.959
INFO:root:Epoch[2] Validation-accuracy=0.699419
INFO:root:Epoch[3] Batch [0-50]	Speed: 1620.48 samples/sec	accuracy=0.747243
INFO:root:Epoch[3] Batch [50-100]	Speed: 1665.64 samples/sec	accuracy=0.747188
INFO:root:Epoch[3] Batch [100-150]	Speed: 1669.65 samples/sec	accuracy=0.744375
INFO:root:Epoch[3] Batch [150-200]	Speed: 1672.57 samples/sec	accuracy=0.756563
INFO:root:Epoch[3] Batch [200-250]	Speed: 1673.09 samples/sec	accuracy=0.755625
INFO:root:Epoch[3] Batch [250-300]	Speed: 1672.16 samples/sec	accuracy=0.757500
INFO:root:Epoch[3] Batch [300-350]	Speed: 1671.06 samples/sec	accuracy=0.757812
INFO:root:Epoch[3] Batch [350-400]	Speed: 1670.54 samples/sec	accuracy=0.754687
INFO:root:Epoch[3] Batch [400-450]	Speed: 1673.20 samples/sec	accuracy=0.774375
INFO:root:Epoch[3] Batch [450-500]	Speed: 1656.83 samples/sec	accuracy=0.768750
INFO:root:Epoch[3] Batch [500-550]	Speed: 1672.77 samples/sec	accuracy=0.772813
INFO:root:Epoch[3] Batch [550-600]	Speed: 1662.18 samples/sec	accuracy=0.770312
INFO:root:Epoch[3] Batch [600-650]	Speed: 1672.07 samples/sec	accuracy=0.770000
INFO:root:Epoch[3] Batch [650-700]	Speed: 1642.67 samples/sec	accuracy=0.780000
INFO:root:Epoch[3] Batch [700-750]	Speed: 1670.11 samples/sec	accuracy=0.776875
INFO:root:Epoch[3] Train-accuracy=0.762764
INFO:root:Epoch[3] Time cost=30.022
INFO:root:Epoch[3] Validation-accuracy=0.731771
INFO:root:Epoch[4] Batch [0-50]	Speed: 1667.95 samples/sec	accuracy=0.778493
INFO:root:Epoch[4] Batch [50-100]	Speed: 1672.75 samples/sec	accuracy=0.790312
INFO:root:Epoch[4] Batch [100-150]	Speed: 1669.29 samples/sec	accuracy=0.776875
INFO:root:Epoch[4] Batch [150-200]	Speed: 1673.50 samples/sec	accuracy=0.792500
INFO:root:Epoch[4] Batch [200-250]	Speed: 1672.97 samples/sec	accuracy=0.783438
INFO:root:Epoch[4] Batch [250-300]	Speed: 1672.72 samples/sec	accuracy=0.796250
INFO:root:Epoch[4] Batch [300-350]	Speed: 1658.90 samples/sec	accuracy=0.784687
INFO:root:Epoch[4] Batch [350-400]	Speed: 1669.21 samples/sec	accuracy=0.790937
INFO:root:Epoch[4] Batch [400-450]	Speed: 1664.05 samples/sec	accuracy=0.800312
INFO:root:Epoch[4] Batch [450-500]	Speed: 1637.17 samples/sec	accuracy=0.789375
INFO:root:Epoch[4] Batch [500-550]	Speed: 1665.37 samples/sec	accuracy=0.799687
INFO:root:Epoch[4] Batch [550-600]	Speed: 1668.98 samples/sec	accuracy=0.806562
INFO:root:Epoch[4] Batch [600-650]	Speed: 1672.85 samples/sec	accuracy=0.809375
INFO:root:Epoch[4] Batch [650-700]	Speed: 1674.14 samples/sec	accuracy=0.816562
INFO:root:Epoch[4] Batch [700-750]	Speed: 1674.87 samples/sec	accuracy=0.800000
INFO:root:Epoch[4] Train-accuracy=0.794457
INFO:root:Epoch[4] Time cost=29.996
INFO:root:Epoch[4] Validation-accuracy=0.741740
INFO:root:Epoch[5] Batch [0-50]	Speed: 1668.07 samples/sec	accuracy=0.809436
INFO:root:Epoch[5] Batch [50-100]	Speed: 1673.35 samples/sec	accuracy=0.810312
INFO:root:Epoch[5] Batch [100-150]	Speed: 1651.66 samples/sec	accuracy=0.807500
INFO:root:Epoch[5] Batch [150-200]	Speed: 1667.67 samples/sec	accuracy=0.809063
INFO:root:Epoch[5] Batch [200-250]	Speed: 1668.76 samples/sec	accuracy=0.808750
INFO:root:Epoch[5] Batch [250-300]	Speed: 1672.72 samples/sec	accuracy=0.810937
INFO:root:Epoch[5] Batch [300-350]	Speed: 1671.69 samples/sec	accuracy=0.816562
INFO:root:Epoch[5] Batch [350-400]	Speed: 1672.54 samples/sec	accuracy=0.818750
INFO:root:Epoch[5] Batch [400-450]	Speed: 1631.24 samples/sec	accuracy=0.822187
INFO:root:Epoch[5] Batch [450-500]	Speed: 1665.93 samples/sec	accuracy=0.815937
INFO:root:Epoch[5] Batch [500-550]	Speed: 1674.52 samples/sec	accuracy=0.819063
INFO:root:Epoch[5] Batch [550-600]	Speed: 1670.75 samples/sec	accuracy=0.812500
INFO:root:Epoch[5] Batch [600-650]	Speed: 1673.81 samples/sec	accuracy=0.825937
INFO:root:Epoch[5] Batch [650-700]	Speed: 1676.04 samples/sec	accuracy=0.827187
INFO:root:Epoch[5] Batch [700-750]	Speed: 1675.77 samples/sec	accuracy=0.817813
INFO:root:Epoch[5] Train-accuracy=0.815501
INFO:root:Epoch[5] Time cost=29.948
INFO:root:Epoch[5] Validation-accuracy=0.749399
INFO:root:Epoch[6] Batch [0-50]	Speed: 1669.17 samples/sec	accuracy=0.837623
INFO:root:Epoch[6] Batch [50-100]	Speed: 1661.24 samples/sec	accuracy=0.813750
INFO:root:Epoch[6] Batch [100-150]	Speed: 1667.14 samples/sec	accuracy=0.830313
INFO:root:Epoch[6] Batch [150-200]	Speed: 1667.80 samples/sec	accuracy=0.826250
INFO:root:Epoch[6] Batch [200-250]	Speed: 1673.15 samples/sec	accuracy=0.826562
INFO:root:Epoch[6] Batch [250-300]	Speed: 1646.27 samples/sec	accuracy=0.836875
INFO:root:Epoch[6] Batch [300-350]	Speed: 1666.01 samples/sec	accuracy=0.829375
INFO:root:Epoch[6] Batch [350-400]	Speed: 1672.95 samples/sec	accuracy=0.834688
INFO:root:Epoch[6] Batch [400-450]	Speed: 1673.64 samples/sec	accuracy=0.835625
INFO:root:Epoch[6] Batch [450-500]	Speed: 1675.71 samples/sec	accuracy=0.843437
INFO:root:Epoch[6] Batch [500-550]	Speed: 1674.81 samples/sec	accuracy=0.849688
INFO:root:Epoch[6] Batch [550-600]	Speed: 1670.66 samples/sec	accuracy=0.848750
INFO:root:Epoch[6] Batch [600-650]	Speed: 1674.67 samples/sec	accuracy=0.850000
INFO:root:Epoch[6] Batch [650-700]	Speed: 1676.15 samples/sec	accuracy=0.852187
INFO:root:Epoch[6] Batch [700-750]	Speed: 1662.28 samples/sec	accuracy=0.840625
INFO:root:Epoch[6] Train-accuracy=0.837408
INFO:root:Epoch[6] Time cost=29.926
INFO:root:Epoch[6] Validation-accuracy=0.755609
INFO:root:Epoch[7] Batch [0-50]	Speed: 1669.53 samples/sec	accuracy=0.851409
INFO:root:Epoch[7] Batch [50-100]	Speed: 1673.99 samples/sec	accuracy=0.851875
INFO:root:Epoch[7] Batch [100-150]	Speed: 1664.78 samples/sec	accuracy=0.845000
INFO:root:Epoch[7] Batch [150-200]	Speed: 1643.95 samples/sec	accuracy=0.848125
INFO:root:Epoch[7] Batch [200-250]	Speed: 1673.32 samples/sec	accuracy=0.846250
INFO:root:Epoch[7] Batch [250-300]	Speed: 1674.50 samples/sec	accuracy=0.854062
INFO:root:Epoch[7] Batch [300-350]	Speed: 1667.81 samples/sec	accuracy=0.868750
INFO:root:Epoch[7] Batch [350-400]	Speed: 1672.58 samples/sec	accuracy=0.856875
INFO:root:Epoch[7] Batch [400-450]	Speed: 1674.09 samples/sec	accuracy=0.856563
INFO:root:Epoch[7] Batch [450-500]	Speed: 1674.60 samples/sec	accuracy=0.855000
INFO:root:Epoch[7] Batch [500-550]	Speed: 1674.48 samples/sec	accuracy=0.868125
INFO:root:Epoch[7] Batch [550-600]	Speed: 1670.71 samples/sec	accuracy=0.854688
INFO:root:Epoch[7] Batch [600-650]	Speed: 1674.68 samples/sec	accuracy=0.859375
INFO:root:Epoch[7] Batch [650-700]	Speed: 1675.54 samples/sec	accuracy=0.867812
INFO:root:Epoch[7] Batch [700-750]	Speed: 1636.57 samples/sec	accuracy=0.861250
INFO:root:Epoch[7] Train-accuracy=0.856634
INFO:root:Epoch[7] Time cost=29.935
INFO:root:Epoch[7] Validation-accuracy=0.751202
INFO:root:Epoch[8] Batch [0-50]	Speed: 1666.25 samples/sec	accuracy=0.862745
INFO:root:Epoch[8] Batch [50-100]	Speed: 1667.20 samples/sec	accuracy=0.871563
INFO:root:Epoch[8] Batch [100-150]	Speed: 1638.39 samples/sec	accuracy=0.859688
INFO:root:Epoch[8] Batch [150-200]	Speed: 1668.52 samples/sec	accuracy=0.874687
INFO:root:Epoch[8] Batch [200-250]	Speed: 1664.86 samples/sec	accuracy=0.866875
INFO:root:Epoch[8] Batch [250-300]	Speed: 1670.59 samples/sec	accuracy=0.866250
INFO:root:Epoch[8] Batch [300-350]	Speed: 1672.36 samples/sec	accuracy=0.872500
INFO:root:Epoch[8] Batch [350-400]	Speed: 1667.79 samples/sec	accuracy=0.876250
INFO:root:Epoch[8] Batch [400-450]	Speed: 1672.58 samples/sec	accuracy=0.875938
INFO:root:Epoch[8] Batch [450-500]	Speed: 1672.51 samples/sec	accuracy=0.871250
INFO:root:Epoch[8] Batch [500-550]	Speed: 1671.49 samples/sec	accuracy=0.878750
INFO:root:Epoch[8] Batch [550-600]	Speed: 1668.27 samples/sec	accuracy=0.884062
INFO:root:Epoch[8] Batch [600-650]	Speed: 1656.65 samples/sec	accuracy=0.882812
INFO:root:Epoch[8] Batch [650-700]	Speed: 1671.64 samples/sec	accuracy=0.884062
INFO:root:Epoch[8] Batch [700-750]	Speed: 1673.34 samples/sec	accuracy=0.874687
INFO:root:Epoch[8] Train-accuracy=0.873581
INFO:root:Epoch[8] Time cost=30.010
INFO:root:Epoch[8] Validation-accuracy=0.766421
INFO:root:Epoch[9] Batch [0-50]	Speed: 1669.04 samples/sec	accuracy=0.879289
INFO:root:Epoch[9] Batch [50-100]	Speed: 1671.88 samples/sec	accuracy=0.887188
INFO:root:Epoch[9] Batch [100-150]	Speed: 1662.53 samples/sec	accuracy=0.867500
INFO:root:Epoch[9] Batch [150-200]	Speed: 1672.37 samples/sec	accuracy=0.881875
INFO:root:Epoch[9] Batch [200-250]	Speed: 1672.11 samples/sec	accuracy=0.886563
INFO:root:Epoch[9] Batch [250-300]	Speed: 1635.77 samples/sec	accuracy=0.870938
INFO:root:Epoch[9] Batch [300-350]	Speed: 1670.30 samples/sec	accuracy=0.884062
INFO:root:Epoch[9] Batch [350-400]	Speed: 1671.09 samples/sec	accuracy=0.879375
INFO:root:Epoch[9] Batch [400-450]	Speed: 1667.68 samples/sec	accuracy=0.883125
INFO:root:Epoch[9] Batch [450-500]	Speed: 1673.33 samples/sec	accuracy=0.885000
INFO:root:Epoch[9] Batch [500-550]	Speed: 1672.83 samples/sec	accuracy=0.883750
INFO:root:Epoch[9] Batch [550-600]	Speed: 1668.54 samples/sec	accuracy=0.887500
INFO:root:Epoch[9] Batch [600-650]	Speed: 1672.97 samples/sec	accuracy=0.890312
INFO:root:Epoch[9] Batch [650-700]	Speed: 1653.01 samples/sec	accuracy=0.889062
INFO:root:Epoch[9] Batch [700-750]	Speed: 1673.44 samples/sec	accuracy=0.889062
INFO:root:Epoch[9] Train-accuracy=0.883263
INFO:root:Epoch[9] Time cost=29.960
INFO:root:Epoch[9] Validation-accuracy=0.762520
INFO:root:Epoch[10] Batch [0-50]	Speed: 1666.71 samples/sec	accuracy=0.887868
INFO:root:Epoch[10] Batch [50-100]	Speed: 1672.06 samples/sec	accuracy=0.882500
INFO:root:Epoch[10] Batch [100-150]	Speed: 1668.15 samples/sec	accuracy=0.881250
INFO:root:Epoch[10] Batch [150-200]	Speed: 1667.18 samples/sec	accuracy=0.899062
INFO:root:Epoch[10] Batch [200-250]	Speed: 1670.72 samples/sec	accuracy=0.881563
INFO:root:Epoch[10] Batch [250-300]	Speed: 1671.63 samples/sec	accuracy=0.890000
INFO:root:Epoch[10] Batch [300-350]	Speed: 1669.62 samples/sec	accuracy=0.905625
INFO:root:Epoch[10] Batch [350-400]	Speed: 1664.69 samples/sec	accuracy=0.904375
INFO:root:Epoch[10] Batch [400-450]	Speed: 1671.13 samples/sec	accuracy=0.901250
INFO:root:Epoch[10] Batch [450-500]	Speed: 1666.08 samples/sec	accuracy=0.896250
INFO:root:Epoch[10] Batch [500-550]	Speed: 1670.59 samples/sec	accuracy=0.905312
INFO:root:Epoch[10] Batch [550-600]	Speed: 1667.69 samples/sec	accuracy=0.894687
INFO:root:Epoch[10] Batch [600-650]	Speed: 1671.95 samples/sec	accuracy=0.895938
INFO:root:Epoch[10] Batch [650-700]	Speed: 1672.98 samples/sec	accuracy=0.909375
INFO:root:Epoch[10] Batch [700-750]	Speed: 1624.72 samples/sec	accuracy=0.909375
INFO:root:Epoch[10] Train-accuracy=0.896667
INFO:root:Epoch[10] Time cost=29.974
INFO:root:Epoch[10] Validation-accuracy=0.764123

@ChaiBapchya
Copy link
Contributor

@NihalHarish thanks for verifying

@TaoLv
The patch doesn't seem to be merged on the master branch. Any reason why it's not being done along with the PR that bumped MKLDNN to v1.0 #16555

diff --git a/make/pip/pip_linux_mkl.mk b/make/pip/pip_linux_mkl.mk
index 1cf389ae4..dd23434fa 100644
--- a/make/pip/pip_linux_mkl.mk
+++ b/make/pip/pip_linux_mkl.mk
@@ -49,7 +49,7 @@ ADD_CFLAGS += -I$(DEPS_PATH)/include -ffunction-sections -fdata-sections
 # choose the version of blas you want to use
 # can be: mkl, blas, atlas, openblas
 # in default use atlas for linux while apple for osx
-USE_BLAS=openblas
+USE_BLAS=mkl

 # whether use opencv during compilation
 # you can disable it, however, you will not able to use
@@ -98,7 +98,7 @@ USE_LAPACK_PATH = $(DEPS_PATH)/lib

 # add path to intel library, you may need it for MKL, if you did not add the path
 # to environment variable
-USE_INTEL_PATH = NONE
+USE_INTEL_PATH = /opt/intel/

If it was omitted by mistake and since it is required, I could push a PR for the same.

  1. Also, do you folks have any data about performance tests run AFTER this patch is applied?

Thanks.

@TaoLv
Copy link
Member

TaoLv commented Dec 8, 2019

@ChaiBapchya The file is used to build mxnet-mkl pip package. If you want to change the configurations, I think you need have a proposal on dev@.

@ptrendx
Copy link
Member

ptrendx commented Dec 10, 2019

What is the status of this issue? From the conversation it seems to me that Intel people think it is not an issue (or at least it is unavoidable) and Amazon people are concerned about this. Is that accurate? If so, how does it affect the 1.6 release - should I go ahead and make the RC despite this issue or is there active work going on to fix it?

@samskalicky
Copy link
Contributor

samskalicky commented Dec 10, 2019

@TaoLv are you saying that we should keep the current config where we build the mkl flavor with openblas:
master:
https://github.com/apache/incubator-mxnet/blob/7895f93e67dc3e9da360f7a9c667e3c0f1e76c0f/make/staticbuild/linux_mkl.mk#L52
1.6.x branch:
https://github.com/apache/incubator-mxnet/blob/a576531836c5a5c4fb6dfbc944de94b619d6ccfa/make/pip/pip_linux_mkl.mk#L52
Or are you proposing that it needs to be changed to build the mkl flavor with mkl blas instead of openblas?

@TaoLv
Copy link
Member

TaoLv commented Dec 11, 2019

mkl flavor packages are always built with USE_BLAS=openblas. We can change that to MKL BLAS if we are allowed to include dependency with category x license [1] into MXNet convenient releases.

[1] https://www.apache.org/legal/resolved.html#category-x

@samskalicky
Copy link
Contributor

Thanks @TaoLv

I was able to rebuild and reproduce Nihal's results:

$ python deeplearning-benchmark/image_classification/image_classification.py --model resnet18_v2 --dataset cifar10 --mode symbolic --gpus 0 --epochs 25 --log-interval 50 --kvstore local --dtype='float32' --batch-size=64
Namespace(batch_norm=False, batch_size=64, benchmark=False, dataset='cifar10', dtype='float32', epochs=25, gpus=0, kvstore='local', log_interval=50, lr=0.01, mode='symbolic', model='resnet18_v2', seed=123, use_pretrained=False, use_thumbnail=False, wd=0.0001)
Archive:  cifar10.zip
   creating: cifar/
  inflating: cifar/test.rec          
  inflating: cifar/test.lst          
  inflating: cifar/train.lst         
  inflating: cifar/train.rec         
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/train.rec, use 4 threads for decoding..
[05:12:00] src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: data/cifar/test.rec, use 4 threads for decoding..
[05:12:00] src/executor/graph_executor.cc:1979: Subgraph backend MKLDNN is activated.
INFO:root:Epoch[0] Batch [0-50]	Speed: 1583.17 samples/sec	accuracy=0.285846
INFO:root:Epoch[0] Batch [50-100]	Speed: 1508.38 samples/sec	accuracy=0.388750
INFO:root:Epoch[0] Batch [100-150]	Speed: 1623.32 samples/sec	accuracy=0.433125
INFO:root:Epoch[0] Batch [150-200]	Speed: 1613.61 samples/sec	accuracy=0.443437
INFO:root:Epoch[0] Batch [200-250]	Speed: 1642.54 samples/sec	accuracy=0.455000
INFO:root:Epoch[0] Batch [250-300]	Speed: 1625.45 samples/sec	accuracy=0.506250
INFO:root:Epoch[0] Batch [300-350]	Speed: 1620.83 samples/sec	accuracy=0.515312
INFO:root:Epoch[0] Batch [350-400]	Speed: 1637.02 samples/sec	accuracy=0.537500
INFO:root:Epoch[0] Batch [400-450]	Speed: 1635.96 samples/sec	accuracy=0.550937
INFO:root:Epoch[0] Batch [450-500]	Speed: 1641.26 samples/sec	accuracy=0.574688
INFO:root:Epoch[0] Batch [500-550]	Speed: 1643.39 samples/sec	accuracy=0.569063
INFO:root:Epoch[0] Batch [550-600]	Speed: 1639.69 samples/sec	accuracy=0.573125
INFO:root:Epoch[0] Batch [600-650]	Speed: 1644.01 samples/sec	accuracy=0.598437
INFO:root:Epoch[0] Batch [650-700]	Speed: 1644.10 samples/sec	accuracy=0.614375
INFO:root:Epoch[0] Batch [700-750]	Speed: 1644.86 samples/sec	accuracy=0.601250

The root cause of this performance regression is from the difference of BLAS libraries (switching from MKL BLAS to OpenBLAS) and removing the libiomp5.so library.

Now the next step is to determine how we want to proceed. Do we continue with OpenBLAS and take the hit on performance, or as @TaoLv mentioned can we use the category x licensed dependency?

@vpirogov
Copy link

Hi @TaoLv, @samskalicky,

Intel MKL-DNN includes GEMM implementation that is comparable in terms of performance to Intel MKL. Is using mkldnn_gemm an option here?

@samskalicky
Copy link
Contributor

samskalicky commented Dec 11, 2019

@TaoLv @pengzhao-intel Are there features in MXNet that require MKL as the BLAS library? I was able to find this line:
https://github.com/apache/incubator-mxnet/blob/c82af38211dbf8356a4f3b35f023632c5bf880ae/src/operator/quantization/quantized_fully_connected.cc#L291

Im rereading the previous comment and now im confused:

@oorqueda @samskalicky @leleamol As mentioned in #16891 (comment), I suspect that the regression is caused by the removal of libiomp5.so.
...
If it's true, I don't think we have any choice to avoid the regression in pip packages as removing libiomp5.so is a requirement from Apache. Please refer to #15544. Thanks!

Is the performance difference coming from using Intel's OpenMP library (libiomp5) or from using the MKL BLAS library itself and some routines like GEMM (as @vpirogov mentions)?

@TaoLv
Copy link
Member

TaoLv commented Dec 12, 2019

@vpirogov @samskalicky Although MKL BLAS may also have positive impact to the case demonstrated above, I think the main gap is from different OMP runtimes. Setting USE_BLAS=mkl will help to pull in iomp5. Sure I'm going to replace cblas_sgemm and cblas_sgemm_batch with the MatMul primitive from DNNL once it's release, but I don't think that will help to fill the gap between gomp and iomp5.

@samskalicky The code you referred will not be called in the ResNet18 case. Most of the computation in ResNet18 should go to DNNL.

@vpirogov
Copy link

@TaoLv, is anything preventing us from using LLVM OpenMP runtime (libomp)? It is pretty much an open source version of libiomp5.

@TaoLv
Copy link
Member

TaoLv commented Dec 12, 2019

@vpirogov We can do that. My only concern is the interoperability of it. Also from MXNet perspective, we need move the release process from make to cmake which I don't think can be done within the schedule of the 1.6.0 release.

@vpirogov
Copy link

What do you mean by interoperability exactly?

@ChaiBapchya
Copy link
Contributor

@TaoLv To get a closure on this topic, would it be possible to move the discussion forward
Thanks

@TaoLv
Copy link
Member

TaoLv commented Dec 18, 2019

@vpirogov @ChaiBapchya The interoperability means:

  • how to pass the threading model to the dependencies of MXNet, eg. openblas, lapack, opencv, dnnl, mkl.
  • how to cooperate with other tools, eg. gomp based numpy or pytorch.

@vpirogov
Copy link

@TaoLv,

You are right that when different OpenMP runtimes are used in the same application there's a potential for interoperability issues. For this particular discussion it's important to note that the interoperability considerations are the same for libiomp5 and libomp. From that perspective using libomp does not introduce any additional issues in comparison to what MXNet used before (i.e. libiomp5).

@TaoLv
Copy link
Member

TaoLv commented Dec 19, 2019

@vpirogov, yes, that's true. libomp and libiomp5 should have the same interoperability issue. From this perspective, the current release build solution (makefile + gomp) sounds a safer choice though it has relatively worse performance. I assume that gomp has better interoperability than the other two runtimes, maybe not true.

@pengzhao-intel
Copy link
Contributor

@samskalicky and all,
The problem is very clean now. I think we need to make a decision and going forward.
Two possible paths as below

  • Keep the build as-is with gomp
    cons, stable and mature now
    pros, a slight performance drop

  • Re-build with llvm by CMake
    cons, same performance as before
    pros, efforts on improving CMake path and potential interoperability issues

From my side, I prefer the first option. What's your opinion?

@apeforest
Copy link
Contributor

Hi @pengzhao-intel, in MXNet 2.0 Cmake is planned to be the only build system: https://github.com/apache/incubator-mxnet/projects/18#card-30594044

Would that address the cons in Option 2?

@TaoLv TaoLv mentioned this issue Feb 25, 2020
@pengzhao-intel
Copy link
Contributor

Hi @pengzhao-intel, in MXNet 2.0 Cmake is planned to be the only build system: https://github.com/apache/incubator-mxnet/projects/18#card-30594044

Would that address the cons in Option 2?

It's a good chance to make the system clean :)

@pengzhao-intel
Copy link
Contributor

closing since the fix has alrady updated with latest MKLDNN version.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests