-
Notifications
You must be signed in to change notification settings - Fork 6.8k
MXNet 1.6.0 performance regression #16845
Comments
Thanks for the validation. Do you have a chance to try CPU? |
Hey @pengzhao-intel, I tested training on GPU and inference on CPU. For inference, the times came very close to each other so I did not see any noticeable regression there! |
Really thanks for your efforts. It's great to hear this. |
Use this script to build your own MXNet |
Seems like a significant drop. The p3 resnet152_v1 test should be fairly easy to reproduce. Has anyone else had a chance to verify this regression? Edit: I notice you're comparing a CUDA 10.1 binary to a 10.0. Have you tried compiling for 10.1? |
I will look into this today and see if I can repro it. |
@KellenSunderland So I noticed that the instances I was running on have CUDA 10.0 so I assumed that when running it would default to that version even though mxnet_cu101 was installed. I will rerun the tests with all the instances running on cu100 because I'm having issues building cu101mkl. |
Ok, so I looked into it and I can kind of see 1.6 being slower, but on the other hand this script is really not a great way of testing performance of the GPU training. Because the kernels are tiny, it is actually dominated by gaps in execution while CPU is trying to launch the kernels (and the line to run it you gave does not even use hybridization to offset it in any way, enabling hybridization improves performance by ~2x). Looking at the GPU kernel time I do not see any real difference, so the slowdown is most probably due to increase in time spent actually launching the ops. |
@apeforest Could somebody do a bisection when this got introduced? |
@jonatan1626 is currently running the experiments again on multiple machines. Earlier he was running them on the same instance and we suspect there might be some performance interfering between runs. |
Is the performance worse if we turned on hybridization? |
@jonatan1626 Could you please update your latest performance comparison result here? |
Here are the results. ptrendx was correct with the cifar10 dataset being too small for testing for the gpu. It doesn't look like there is a regression between the two versions! Running Imagenet training on the p3.16xlarge GPUs with mxnet-cu100:
|
Running CIFAR10 training on the c5.18xlarge cpus with mxnet-cu100 (Note: I just set the number of gpus to 0 for the script):
|
Hi @jonatan1626, out of curiosity, what does the "_LT" stand for (e.g. in "mxnet1.6_gpu_LT")? |
@ptrendx LT means Large Tensor |
@jonatan1626 Thanks for the update. Can you also share the results for the other a few models? If there is no performance regression for 1.6 release, I think we can close this issue. |
I find that the results have changed. Is it achieved by running the same script you provided in the first comment? |
@sxjscience Apologies I forgot to mention I am now using this script for imagenet: |
@apeforest These are the results for the other model runs. These runs were done using cu101-mkl running on p3x16 machines with Cuda 10.0. These runs were run sequentially, so I think the memory issue that you mentioned might be a reason why 1.6.x numbers are slightly slower. I will rerun these tests again to revalidate.
|
Could you run CPU benchmark with mxnet-mkl or mxnet-cuXXmkl? |
@jonatan1626 Thanks for the detailed report. This looks great. Please run mxnet-mkl for CPU performance test as @pengzhao-intel suggested. I guess we don't need to report mxnet1.6_LT since it's not an official release. It would be great if you can put your run script together with logs in a repo and share it here so we can reproduce or track later on. Thanks. Lin |
I remember we have a plan to make a dashboard to track the performance :) |
@pengzhao-intel The runs just finished there was an error when running resnet50_v1, so I have restarted the job and will post the results when it is done! It does look like there is a regression between the mkl versions. @apeforest Let me compile and organize the data first then I'll put it in a repo. I am also figuring out how to push the data to cloudwatch so we have a dashboard to track the performance!
|
I have also uploaded the scripts to: Here. Do let me know if there is anything wrong with how I'm running this! |
@rongzha1 please try to run the script and verify the CPU performance. |
cc @TaoLv |
seems no regression issue on mkl dnn CPU platform.
|
What is the status of this issue? Based on the result gathered by @rongzha1 it seems we can close this issue? |
Description
I wanted to report an issue with the performance of mxnet 1.6.x when comparing it to the previous version, mxnet 1.5.x
To Reproduce
To install 1.5:
pip install mxnet-cu101mkl
To install 1.6.x:
checkout https://github.com/apeforest/mxnet-build-script
Change the version to v1.6.x in the Dockerfile
Follow instructions and launch and build mxnet-cu100
Copy pip wheel out of the container and pip install
Gluon Model Zoo Site for Cifar 10
Link for the Cifar 10 Training Script
Launch nvidia-smi to get GPU Memory Usage:
nvidia-smi --query-gpu=index,memory.used --format=csv -l 30 -f <file_location>
Launch model training:
python train_cifar10.py --num-gpus 1 --model resnet152_v1 --num-epochs 40
python train_cifar10.py --num-gpus 1 --model resnet101_v1 --num-epochs 40
The script will print out time it took per epoch.
I used this regex to match and grab the time after 3 epochs, treating those as the warmup.
r'.*\[Epoch ([0-9]*).*\].* time: ([0-9]*\.?[0-9]*)'
I then took the average, min, max values.
The text was updated successfully, but these errors were encountered: