Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[1.x] Backport Fix for duplicate subgraph inputs/outputs (#16131) #19112

Merged
merged 12 commits into from
Sep 15, 2020

Conversation

samskalicky
Copy link
Contributor

@samskalicky samskalicky commented Sep 10, 2020

Description

This PR started out as a backport #16131 to v1.x. But then ran into some issues with MKLDNN subgraphing. I wasnt able to enhance the MKLDNN subgraphing to work with deduplicated inputs/outputs since there were too many hard-coded magic numbers to decode. Instead I added a flag dedup_subgraph to enable deduplicating subgraph inputs/outputs for our needs, but disable it by default to keep current MKLDNN/TensorRT and any other partitioning flows.

So now we set an attribute on the Graph and check for it in build_subgraph.cc "dedup_subgraph":

bool dedup_subgraph = g->HasAttr("dedup_subgraph");
...
CutGraphInputs(input_entries, &orig_input_entries, &unique_orig_entries,
                 &unique_input_entries, false, dedup_subgraph);
...
if (dedup_subgraph)
      subg_prop->ConnectSubgraphInputs(n, &unique_input_entries, &unique_orig_entries);
else
      subg_prop->ConnectSubgraphInputs(n, &input_entries, &orig_input_entries);

We set this attribute on the Graph in MXOptimizeForBackend :

// set dedup option as attribute on graph to enable dedup during partitioning
if (options_map.count("dedup_subgraph") > 0 &&
      options_map.at("dedup_subgraph").compare("True") == 0)
g.attrs["dedup_subgraph"] = std::make_shared<nnvm::any>(std::string("True"));

And it is set by users as an argument to optimize_for:

part_sym = sym.optimize_for(subgraph_backend, arg_dict, aux_dict, dedup_subgraph=True)

* fix for duplicate inputs

* fixed error

* fixed whitespace

* Remove duplicate outputs from subgraphs

* changed subgraph to create map of outputs

* added static_cast

* changed map<int,v> to vector

* sanity fix

* sanity2

* updated backends with new connectSubgraphOutputs API

* fixed map creation logic

* added updates for reattach function

* creating node only if it is not an input to subgraph

* creating object based on var_name only

* updating ConnectSubgraphOutputs for mkldnn_elemwisemul_post_quantize_property.h

* add debug prints to debug error in CI

* remove prints

* added prints to debug in the CI

* revert changes

* reverted changes

* deduplicaated inputs to subgraph

* deduplicated subgraph inputs

* simplified inputs

* cleaned up

* deduplicate outputs

* cleand up

* added deduplication to subgraph node outputs

* fixed prev compare

* fixed issue with inputs and added test

* fixd whitespace, removed prints

Co-authored-by: Sam Skalicky <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
@mxnet-bot
Copy link

Hey @samskalicky , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [miscellaneous, windows-cpu, unix-cpu, edge, unix-gpu, website, windows-gpu, centos-gpu, centos-cpu, sanity, clang]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@samskalicky
Copy link
Contributor Author

samskalicky commented Sep 11, 2020

@ZhennanQin @pengzhao-intel Im debugging a test failure with this PR:

[2020-09-10T19:49:11.883Z] ======================================================================
[2020-09-10T19:49:11.883Z] ERROR: test_subgraph.test_mobilenetv2_struct
[2020-09-10T19:49:11.883Z] ----------------------------------------------------------------------
[2020-09-10T19:49:11.883Z] Traceback (most recent call last):
[2020-09-10T19:49:11.883Z]   File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
[2020-09-10T19:49:11.883Z]     self.test(*self.arg)
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 215, in test_new
[2020-09-10T19:49:11.883Z]     orig_test(*args, **kwargs)
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 815, in test_mobilenetv2_struct
[2020-09-10T19:49:11.883Z]     check_fusion(net, data_shape, attrs, out_types=['int8', 'auto'])
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/../unittest/common.py", line 215, in test_new
[2020-09-10T19:49:11.883Z]     orig_test(*args, **kwargs)
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/tests/python/mkl/test_subgraph.py", line 271, in check_fusion
[2020-09-10T19:49:11.883Z]     exe = sym.bind(ctx=mx.current_context(), args=arg_array, aux_states=aux_array, grad_req='null')
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/python/mxnet/symbol/symbol.py", line 2119, in bind
[2020-09-10T19:49:11.883Z]     ctypes.byref(handle)))
[2020-09-10T19:49:11.883Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call
[2020-09-10T19:49:11.883Z]     raise get_last_ffi_error()
[2020-09-10T19:49:11.883Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2020-09-10T19:49:11.883Z]   [bt] (9) /usr/bin/python3(PyEval_EvalFrameEx+0x4eff) [0x53fe5f]
[2020-09-10T19:49:11.883Z]   [bt] (8) /usr/bin/python3(PyObject_Call+0x47) [0x5c59d7]
[2020-09-10T19:49:11.883Z]   [bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb) [0x7f7bac29efcb]
[2020-09-10T19:49:11.883Z]   [bt] (6) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7f7bac2ab01a]
[2020-09-10T19:49:11.883Z]   [bt] (5) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f7bac2b088b]
[2020-09-10T19:49:11.883Z]   [bt] (4) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f7bac2b0e20]
[2020-09-10T19:49:11.883Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(MXExecutorBindEX+0xdcb) [0x7f7b0675858b]
[2020-09-10T19:49:11.883Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*)+0x39b) [0x7f7b05d9d67b]
[2020-09-10T19:49:11.883Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x58f27f1) [0x7f7b05d9b7f1]
[2020-09-10T19:49:11.883Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f7b01151441]
[2020-09-10T19:49:11.883Z]   File "src/executor/graph_executor.cc", line 1892
[2020-09-10T19:49:11.883Z] MXNetError: Check failed: arg_names.size() == in_args_map.size() (8 vs. 7) : 

https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-19112/runs/1/nodes/296/steps/781/log/?start=0
And I think I narrowed it down to this part of the mkldnn conv subgraph:
https://github.com/apache/incubator-mxnet/blob/2d077db1c92dbf7db979e15609fddf5f371277c0/src/operator/subgraph/mkldnn/mkldnn_conv_property.h#L273-L277
Commenting out the rotation seems to resolve the issue.

But now im getting a segfault:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffe60e4bc9 in nnvm::pass::(anonymous namespace)::MXAllocMemory(nnvm::Graph const&, nnvm::IndexedGraph const&, std::pair<unsigned int, unsigned int> const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, std::vector<unsigned int, std::allocator<unsigned int> > const&, nnvm::pass::(anonymous namespace)::MXGraphAllocator*) () from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
(gdb) bt
#0  0x00007fffe60e4bc9 in nnvm::pass::(anonymous namespace)::MXAllocMemory(nnvm::Graph const&, nnvm::IndexedGraph const&, std::pair<unsigned int, unsigned int> const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, std::vector<unsigned int, std::allocator<unsigned int> > const&, nnvm::pass::(anonymous namespace)::MXGraphAllocator*) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#1  0x00007fffe60e6884 in nnvm::pass::(anonymous namespace)::MXPlanMemory(nnvm::Graph) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#2  0x00007fffe60ac9bc in std::_Function_handler<nnvm::Graph (nnvm::Graph), nnvm::Graph (*)(nnvm::Graph)>::_M_invoke(std::_Any_data const&, nnvm::Graph&&) () from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#3  0x00007fffe7f8a4ce in nnvm::ApplyPasses(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#4  0x00007fffe61318ad in nnvm::ApplyPass(nnvm::Graph, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#5  0x00007fffe69e97c9 in mxnet::exec::GraphExecutor::FinishInitGraph(nnvm::Symbol, nnvm::Graph, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#6  0x00007fffe69eb1c6 in mxnet::exec::GraphExecutor::Init(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*, std::unordered_map<nnvm::NodeEntry, mxnet::NDArray, nnvm::NodeEntryHash, nnvm::NodeEntryEqual, std::allocator<std::pair<nnvm::NodeEntry const, mxnet::NDArray> > > const&) () from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so
#7  0x00007fffe69f7f86 in mxnet::Executor::Bind(nnvm::Symbol, mxnet::Context const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, mxnet::Context, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, mxnet::Context> > > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, mxnet::Executor*) ()
   from /home/ubuntu/subgraph_fixv18/python/mxnet/../../lib/libmxnet.so

I debugged further and found the segfault was coming from here:
https://github.com/apache/incubator-mxnet/blob/664889a2c02705352061c0b43c2c9b46085bd868/src/operator/subgraph/mkldnn/mkldnn_conv.cc#L80-L83
I changed the return 2 +... to return 1 +... now that we removed extra inputs. Can you guys explain how this "magic calculation" works?

@samskalicky
Copy link
Contributor Author

There seems to be a lot of hard coded magic numbers in the subgraph/mkldnn. @HahTK I dont think we're going to be able to get this PR merged in time for the v1.8 code freeze without some help. Worst case we'll just get this later on master in 2.0.

@samskalicky
Copy link
Contributor Author

@HahTK ive added support for specifying dedup_subgraph=True to optimize_for to enable subgraph input/output deduplication for our usage. This will allow us to move forward in parallel while we work with the Intel folks to incorporate the improvement for MKLDNN later.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review labels Sep 13, 2020
@samskalicky
Copy link
Contributor Author

@mseth10 @HahTK this PR needs another review since we ran into issues and had to tweak the features a bit. Please review, thanks!

@samskalicky samskalicky added the pr-awaiting-review PR is waiting for code review label Sep 13, 2020
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review labels Sep 13, 2020
@lanking520 lanking520 added pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 13, 2020
std::vector<nnvm::NodeEntry> *unique_orig_entries,
std::vector<nnvm::NodeEntry*> *unique_input_entries,
const bool skip_var = false,
const bool dedup = false) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies everywhere where we set the defaults for the dedup flag.

It seems like this is more a problem with MKLDNN.
I understand they have done that for master branch.
Ideally, they fix it at their end for 1.8 as well.

Failing that, we should go with default = true, except when compiled for MKLDNN and optimize_for() is called by CPU. All other paths would want to optimize away the dupe input.
This amounts to MKLDNN patching their fix.

GIven the timelines for MXNet 1.8, this may be too tight anyway.
More importantly, the MKLDNN issues are fixed in master.

I am ok going ahead with this as is

Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samskalicky samskalicky added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Sep 14, 2020
@samskalicky samskalicky merged commit 9dfac79 into apache:v1.x Sep 15, 2020
@bgawrych
Copy link
Contributor

@samskalicky I've working on enabling MKLDNN partitioning on master and discovered that this PR broke this mobilenet_struct_v2 test. I haven't seen this PR before, so I'm sorry for late response:

image

I've described where problem is in the picture above.
I want to ask if you have some idea how this could be resolved.
I thought about providing additional parameter to MKLDNN fused operator, but I consider if there is some other better way?

@samskalicky
Copy link
Contributor Author

@samskalicky I've working on enabling MKLDNN partitioning on master and discovered that this PR broke this mobilenet_struct_v2 test. I haven't seen this PR before, so I'm sorry for late response:
I've described where problem is in the picture above.
I want to ask if you have some idea how this could be resolved.
I thought about providing additional parameter to MKLDNN fused operator, but I consider if there is some other better way?

Hi @bgawrych sorry looks like I tagged the wrong people, will start tagging you for MKLDNN related issues in the future.

There seems to be a lot of hard coded magic numbers in the subgraph/mkldnn

The problem that I ran into in v1.x was that the mkldnn partitioning flow was expecting the input to be duplicated, and was playing around with wiring up the the subgraph nodes. I wasnt able to decode the current flow and make it work with deduplicated inputs.

Here are two of the places where the code is expecting inputs to be duplicated:
https://github.com/apache/incubator-mxnet/blob/b9105784fa4d9ffc589c4d1010b3384a5420969d/src/operator/subgraph/mkldnn/mkldnn_conv.cc#L80-L83
https://github.com/apache/incubator-mxnet/blob/b9105784fa4d9ffc589c4d1010b3384a5420969d/src/operator/subgraph/mkldnn/mkldnn_conv.cc#L126-L129

Ideally, I would want to refactor the code so that there no baked in expectations. And however the subgraph is partitioned, will work with the subgraph op (ie. SgMKLDNNConvOperator).

@samskalicky
Copy link
Contributor Author

In the meantime, i'll work on porting the dedup_subgraph option to the master branch and we can work on improving the MKLDNN subgraphing there.

part_sym = sym.optimize_for(subgraph_backend, ..., dedup_subgraph=True)

samskalicky added a commit that referenced this pull request Oct 1, 2020
* initial commit

* update build_subgraph

* added test

* calling test

Co-authored-by: Ubuntu <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants