-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-33] SSD example not working with mkl-dnn #10021
[MXNET-33] SSD example not working with mkl-dnn #10021
Conversation
There is a test case for max pooling with |
@@ -92,6 +92,8 @@ inline bool SupportMKLDNNPooling(const PoolingParam ¶m, | |||
|
|||
if (param.pooling_convention == pool_enum::kValid) | |||
return true; | |||
else | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes the later code unreachable. Are you sure this is what you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong the later code is being disabled for now until we support all "full" pooling convention cases. As @TaoLv mentioned, i will add a unit-test for this failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use "#if 0" to comment the checks below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test that produces the crash
@marcoabreu @TaoLv @zheng-da Unit-tests currently do not seem to test backward/training mode for pooling very well. Below is the test for reproducing this crash.
This PR will fix the above crash. |
Feel free to develop and add an appropriate unit test yourself. Please don't hesitate to reach out to us in case you need assistance. |
@marcoabreu this issue only happens inside a particular model (SSD VGG16-reduced), i clarified in title/description. I'm not sure a special unittest just for this specific model is appropriate. I am not able to reproduce this issue on separately. please let me know if any other suggestions/feedback, thanks!. |
That's not an issue, we're already testing a variety of models and the bigger diversity, the better. We can re-visit this decision later on and reduce the test size if required. For now, it would be good to get a test since we would like to re-enable pooling for MKL-DNN at a later point in time - this test would give us a tool to verify that it's actually working again. |
tests/python/cpu/test_mkldnn.py
Outdated
""" | ||
|
||
import mxnet as mx | ||
model = os.path.join(os.path.dirname(os.path.realpath(__file__)), "test_model.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move this file to a subdirectory like tests/python/cpu/data/test_mkldnn_ test_mkldnn_model_model1.json? (format is %FILENAME%_%TESTNAME%_model%NUMBER%.json)?
tests/python/cpu/test_mkldnn.py
Outdated
"variable settings" | ||
|
||
|
||
def test_mkldnn_model(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain that this test has no asserts because it is only supposed to run successfully without failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcoabreu i added assert for sanity check, and renamed file as you mentioned. any other suggestions , please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
||
// need to support pooling convention full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a link to the issue as part of the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean , link to this PR ? or JIRA ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To jira
tests/python/cpu/test_mkldnn.py
Outdated
"variable settings" | ||
|
||
|
||
def test_mkldnn_model(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
773a5af
to
6ca171a
Compare
#10092 |
@ashokei @pengzhao-intel do you know when the Intel MKLDNN team will fix the bug in MKLDNN? |
tests/python/cpu/test_mkldnn.py
Outdated
@@ -0,0 +1,94 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests in tests/python/cpu
are currently not run in CI. Please add it to the MKLDNN jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcoabreu This PR only adds to existing test, no new test file is created. Can you be specific which CI jobs you are referring to ? If there is a gap in CI, a new JIRA item can be created to address it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://github.com/apache/incubator-mxnet/blob/master/Jenkinsfile#L439, which in the end points to https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L387, we're not running the tests located in tests/python/cpu. Please rename tests/python/cpu to tests/python/mkl and add another function in the same style as https://github.com/apache/incubator-mxnet/blob/master/Jenkinsfile#L458.
@zheng-da This is not a bug in MKL-DNN, it is an issue with current MKL-DNN mxnet integration. This PR will use CPU fallback. |
what i mean is that mkldnn doesn't support "full" pooling convention. will mkldnn eventually support it? if it does, when will it support it? |
@zheng-da sorry for the confusion, mkl-dnn has more general pooling support, it allows us to implement any arbitrary convention we like. So it is not bug in mkl-dnn, it is just that we havent integrated this specific use-case in our mxnet mkldnn pooling implementation yet. |
oh, i see. do you mind implementing it? fallback is fine, but it'll be better if it can invoke the one in MKLDNN. |
yes, we plan to add new features for pooling implementation with mkl-dnn in future PR. |
@marcoabreu does this look ok, thanks. |
904b3fa
to
8dd8e35
Compare
7e63aa1
to
ed0f909
Compare
@marcoabreu fixed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, thanks for fixing the example :)
ed0f909
to
8d4171d
Compare
@ashokei can you please check if the CI failure is related to your change and retrigger it if not ? |
8d4171d
to
f6cbd76
Compare
@anirudh2290 done...thanks. |
@piiswrong is this good to merge ? |
@piiswrong this PR is to fix a bug in the MKLDNN integration. MKLDNN pooling can't handle full pooling convention correctly. |
f6cbd76
to
079fba2
Compare
079fba2
to
4fbee8d
Compare
@piiswrong can you please merge this, this PR also has mkldnn jenkins/unittest as requested by @marcoabreu |
* use mkl-dnn for 'valid' pooling_convention only * pooling convention full not supported by current mkl-dnn impl * disable unreachable code * add sample model test for mkldnn * fix review feedback * add jira link to comment * fix lint issue * rename python test for mkl * enable python tests for mkldnn in CI * use vgg16 with convention full * fix unittest
* use mkl-dnn for 'valid' pooling_convention only * pooling convention full not supported by current mkl-dnn impl * disable unreachable code * add sample model test for mkldnn * fix review feedback * add jira link to comment * fix lint issue * rename python test for mkl * enable python tests for mkldnn in CI * use vgg16 with convention full * fix unittest
* use mkl-dnn for 'valid' pooling_convention only * pooling convention full not supported by current mkl-dnn impl * disable unreachable code * add sample model test for mkldnn * fix review feedback * add jira link to comment * fix lint issue * rename python test for mkl * enable python tests for mkldnn in CI * use vgg16 with convention full * fix unittest
* use mkl-dnn for 'valid' pooling_convention only * pooling convention full not supported by current mkl-dnn impl * disable unreachable code * add sample model test for mkldnn * fix review feedback * add jira link to comment * fix lint issue * rename python test for mkl * enable python tests for mkldnn in CI * use vgg16 with convention full * fix unittest
Description
SSD VGG16-reduced example is not working if pooling convention is "full" with Max pool. This PR resolves the this issue by defaulting to CPU implementation when pooling convention is "full".
Checklist
Essentials
make lint
)Changes
Comments