Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet external operators #18904

Closed
wants to merge 49 commits into from
Closed

Conversation

samskalicky
Copy link
Contributor

@samskalicky samskalicky commented Aug 11, 2020

Description

Hi MXNet community,

I would like to propose another feature in MXNet Extensions for "external ops". Currently we have the following categories of operators in MXNet:

  • builtin/backend operators are those defined in the src/operator directory and are compiled into libmxnet.so
  • Python custom operators are custom operators written in Python by users (not part of MXNet source code)
  • C++ custom operators are custom operators written in C++ by users (not part of MXNet source code) that use the MXNet Extensions (lib_api.h)

External operators will be builtin/backend operators but not compiled into libmxnet.so. Instead, they will be compiled into a separate shared object and dynamically loaded at runtime similar to C++ custom operators.

Motivation

Current C++ custom operators were designed to be simple to write, and easy to add to MXNet. They were not intended to be feature-equivalent to builtin/backend operators. There have been many users who start with a fork of MXNet, add some custom operators, and then try and use the C++ custom operator to load them. But C++ custom operators were not designed to be used this way.

Current examples

  • [MXNET-1446] Quantization: intgemm matrix multiply wrappers  #17559 intgemm operators could be made available as a separate library of ops instead of contributing directly to the MXNet repo to distribute to users
  • Prototyping of new operators (like those for BERT) can have a better flow:
    • start in custom fork, build, test
    • compile into separate library, shared with others, test
    • community rejects the new ops, or new ops cause complications to MXNet codebase so best to keep them separate
    • keep op source code in separate library, depend on MXNet as 3rdparty/submodule, build/distribute separate library of ops

Current prototype flow

  1. add your custom op files to src/operator directory
  2. compile MXNet normally
  3. go find my_op.cc.o file(s) in the build directory
  4. package *.o files into a shared object like libmy_op.so
  5. dynamically load shared library into MXNet via mx.library.load()

The example in example/extensions/lib_external_ops shows a minimum useful example. In the min_ex-inl.h file we define the minimum required components of a backend operator in MXNet:

void MinExForward(const nnvm::NodeAttrs& attrs,
                  const OpContext& ctx,
                  const std::vector<TBlob>& inputs,
                  const std::vector<OpReqType>& req,
                  const std::vector<TBlob>& outputs) {
  //do nothing                                                                                                                                                                         
}


inline bool MinExOpShape(const nnvm::NodeAttrs& attrs,
                         mxnet::ShapeVector* in_attrs,
                         mxnet::ShapeVector* out_attrs) {
    //do nothing                                                                                                                                                                       
    return true;
}

inline bool MinExOpType(const nnvm::NodeAttrs& attrs,
                        std::vector<int> *in_attrs,
                        std::vector<int> *out_attrs) {
  //do nothing                                                                                                                                                                         
  return true;
}

Then in the min_ex.cc file we register the operator:

NNVM_REGISTER_OP(min_ex)
.describe("some description")
.set_num_inputs(0)
.set_num_outputs(0)
.set_attr<mxnet::FInferShape>("FInferShape", MinExOpShape)
.set_attr<nnvm::FInferType>("FInferType", MinExOpType)
.set_attr<FCompute>("FCompute<cpu>", MinExForward);

While this operator doesnt actually do anything, it has enough parts to register successfully and execute. Putting these two files in src/operator and building MXNet will produce a min_ex.cc.o file somewhere. After copying this file back into the example/extensions/lib_external_ops directory we can build a shared object with the operator implementation in it. The compilation command in this example is:

g++ -shared -fPIC -std=c++11 init_lib.cc min_ex.cc.o ../../../src/lib_api.cc -o libmin_ex.so -I../../../include -L../../../build -lmxnet

Notice that we compile lib_api.cc and init_lib.cc into the library also, these are the APIs necessary to use the MXLoadLib API. We also link it against libmxnet.so since the library depends on symbols in MXNet (after all thats the whole point, for this operator to have access to all the internal goodies that regular operators have). After building libmin_ex.so we can load it into MXNet dynamically at runtime by using the existing library.load() API:

mx.library.load('libmin_ex.so`)

During loading the of the library (ie. in dlopen) the operator will be registered directly into the MXNet operator Registry. There are no additional overheads or lambda functions for these operators.

Other Considerations

  • Needed to refactor lib_api.h into lib_api.cc so that we can compile the necessary APIs into the shared object in order to load the library using the same MXLoadLib API
  • Needed to remove the dlclose on libraries loaded inside libmxnet.so. Since operators are existing in the library loaded by MXNet, MXNet needs to destruct and free all allocated memory before exiting. But, the objects/functions in the library still need to be available during destruction. So the custom library must outlive libmxnet.so. For python, we can move the dlclose from libmxnet.so to Python no problem. But for C/C++ users, they will need to call dlcose on their own. So we had to change the signature of MXLoadLib to: MXNET_DLL int MXLoadLib(const char *path, unsigned verbose, void** lib); and return the void* pointer to the library.

Open Questions

  • What symbols do we need to make available in libmxnet.so for external operators that might be stripped out?
  • How to validate the version of MXNet that the custom ops were compiled with, versus the version of MXNet that the library is loaded into
  • How to validate build options used to compile ops, versus build options of MXNet that the library is loaded into

@mxnet-bot
Copy link

Hey @samskalicky , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, website, sanity, unix-gpu, windows-gpu, centos-gpu, windows-cpu, edge, unix-cpu, miscellaneous, centos-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@szha
Copy link
Member

szha commented Aug 16, 2020

@samskalicky thanks for starting on this! does it supersede the custom c++ ops?

@samskalicky
Copy link
Contributor Author

@samskalicky thanks for starting on this! does it supersede the custom c++ ops?

No, I think C++ custom ops are still useful for those not familiar with MXNet backend or who dont want to build MXNet from source. C++ custom ops are the improvement on Python custom ops, easy to write, easy to use, almost as fast as built-in ops.

External ops will be for those willing to build MXNet from source, or who need to absolute highest performance for the op. In terms of performance we're talking about ops with a short compute time (ie. less than 10ms) where the overhead of the C++ custom op is not tolerable. For ops with lots of computation, the perf will be bounded by the compute and the C++ custom op overhead wont be the bottleneck.

Im working on a project (at some point soon i'll be able to share more details) using C++ custom ops for subgraphs, and able to achieve performance improvements over built-in MXNet ops. So theres still motivation to keep them for their simplicity.

@samskalicky samskalicky requested a review from leezu as a code owner August 16, 2020 18:22
@samskalicky samskalicky changed the title [WIP] MXNet external operators [RFC] MXNet external operators Aug 16, 2020
@samskalicky
Copy link
Contributor Author

Hi MXNet community,

I would like to propose another feature in MXNet Extensions for "external ops". Please check out the detailed summary [1].

Thanks!
Sam

[1] #18904

@szha szha added the RFC Post requesting for comments label Aug 17, 2020
@kpuatamazon
Copy link
Contributor

What sort of binary compatibility guarantees does MXNet make for this interface? If the safe option is always to compile at the same time, I don't see much difference from including it in the build?

@samskalicky
Copy link
Contributor Author

What sort of binary compatibility guarantees does MXNet make for this interface? If the safe option is always to compile at the same time, I don't see much difference from including it in the build?

This interface makes no binary compatibility guarantees. You would need to compile and link your operators agains the version of MXNet you intend to load your library into. Further, you would need to build it in the same environment as MXNet (gcc, glibc, etc).

The main benefits of between external operators over built-in operators are:

  1. They do not need to be part of the MXNet codebase (no PR, no community approval, etc). This also lets you more rapidly innovate on your custom operators since you dont need it to be part of a formal MXNet release or go through the PR process. Bug fixes on your custom operators can happen in parallel outside of the MXNet community.
  2. You can distribute your custom ops separately from MXNet. For example, if you already have another Python package that you distribute (ie. as a wheel) you can package your operators within that package and dynamically load them into MXNet.

Of course, you can still always build and distribute your own fork of MXNet with your custom operators in it.

This style of external operator is similar to how other DL frameworks support custom operators. So if you're familiar with writing custom operators for TF, PT, etc this is exactly the same.

@kpuatamazon
Copy link
Contributor

kpuatamazon commented Aug 17, 2020

I think this would be much cleaner if it was a separate directory because:

  • Version control is much easier. Just update mxnet or rm -rf it without extra cruft
  • This reflects reality much better. Somebody else builds mxnet for pip without knowing about my stuff. MXNet ships a docker in which I can build my thing for binary compatibility. (I think half of this already exists for running tests.)
  • MXNet should be usable as a submodule.

So my ideal instructions look more like

  1. compile MXNet normally from a clean checkout
  2. cd into your own project, configure with -DMXNET=/path/to/mxnet and compile. Sample project provided and part of integration tests.
  3. my_op.so built by my build system
  4. dynamically load shared library into MXNet via mx.library.load()

@samskalicky
Copy link
Contributor Author

samskalicky commented Aug 17, 2020

I think this would be much cleaner if it was a separate directory because:

  • Version control is much easier. Just update mxnet or rm -rf it without extra cruft
  • This reflects reality much better. Somebody else builds mxnet for pip without knowing about my stuff. MXNet ships a docker in which I can build my thing for binary compatibility.
  • MXNet should be usable as a submodule.

So my ideal instructions look more like

  1. compile MXNet normally from a clean checkout
  2. cd into your own project, configure with -DMXNET=/path/to/mxnet and compile. Sample project provided and part of integration tests.
  3. my_op.so built by my build system
  4. dynamically load shared library into MXNet via mx.library.load()

Agreed, if only MXNet codebase was better organized. We have headers/includes scattered throughout the codebase. Not just in include/mxnet. For example mxnet_op.h in in src/operator and include/mshadow/base.h includes mkl_blas.h which isnt in the MXNet codebase. Duplicating and reproducing the MXNet cmake flow for building custom operators is not worth the hassle/maintenance.

If others have ideas that they wanna give it a whirl, feel free to checkout this branch and try building min_ex.cc in the lib_external_ops directory. Happy to collaborate on this.

I added a test target in the Makefile in lib_external_ops to try and compile/link as you suggest:
https://github.com/apache/incubator-mxnet/pull/18904/files#diff-8a8d486c6b362b11bec05cdd67b3c3bdR32-R33
But currently its failing with:

In file included from ../../../include/mshadow/tensor.h:35:0,
                 from ../../../include/mxnet/base.h:33,
                 from ../../../src/operator/mxnet_op.h:30,
                 from min_ex-inl.h:26,
                 from min_ex.cc:20:
../../../include/mshadow/./base.h:173:12: fatal error: mkl_blas.h: No such file or directory
   #include <mkl_blas.h>
            ^~~~~~~~~~~~

Im happy to be wrong, but I think this is an indication of more to come in terms of managing a complex set of includes/dependencies that will constantly be changing. And we havent even gotten into the cmake options mapping to defines/compileOptions/etc

config/linux_gpu.cmake Outdated Show resolved Hide resolved
@samskalicky
Copy link
Contributor Author

So my ideal instructions look more like

  1. compile MXNet normally from a clean checkout
  2. cd into your own project, configure with -DMXNET=/path/to/mxnet and compile. Sample project provided and part of integration tests.
  3. my_op.so built by my build system
  4. dynamically load shared library into MXNet via mx.library.load()

@kpuatamazon Heres another idea. In order to build files that used to be in src/operator for example, we need to have all the defines, includes, etc that is normally set in the MXNet build:

[ 98%] Building CXX object CMakeFiles/external_lib.dir/example/extensions/lib_external_ops/min_ex.cc.o
/usr/bin/ccache /usr/bin/c++  -DDMLC_CORE_USE_CMAKE -DDMLC_LOG_FATAL_THROW=1 -DDMLC_LOG_STACK_TRACE_SIZE=0 -DDMLC_MODERN_THREAD_LOCAL=0 -DDMLC_STRICT_CXX11 -DDMLC_USE_CXX11 -DDMLC_USE_CXX11=1 -DDMLC_USE_CXX14 -DMSHADOW_INT64_TENSOR_SIZE=0 -DMSHADOW_IN_CXX11 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_CUDA=0 -DMSHADOW_USE_MKL=0 -DMSHADOW_USE_SSE -DMXNET_USE_BLAS_OPEN=1 -DMXNET_USE_LAPACK=1 -DMXNET_USE_LIBJPEG_TURBO=0 -DMXNET_USE_MKLDNN=1 -DMXNET_USE_OPENCV=1 -DMXNET_USE_OPENMP=1 -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_SIGNAL_HANDLER=1 -DNDEBUG=1 -D__USE_XOPEN2K8 -Dexternal_lib_EXPORTS -I/home/ubuntu/external_ops/3rdparty/mkldnn/include -I/home/ubuntu/external_ops/build/3rdparty/mkldnn/include -I/home/ubuntu/external_ops/include -I/home/ubuntu/external_ops/src -I/home/ubuntu/external_ops/3rdparty/tvm/nnvm/include -I/home/ubuntu/external_ops/3rdparty/tvm/include -I/home/ubuntu/external_ops/3rdparty/dmlc-core/include -I/home/ubuntu/external_ops/3rdparty/dlpack/include -I/home/ubuntu/external_ops/include/mxnet -I/home/ubuntu/external_ops/3rdparty/mshadow -I/home/ubuntu/external_ops/3rdparty/mkldnn/src/../include -I/home/ubuntu/external_ops/build/3rdparty/dmlc-core/include -isystem /usr/local/include -isystem /usr/local/include/opencv4  -Wall -Wno-sign-compare -O3 -fopenmp -fPIC   -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -mf16c -std=gnu++1z -o CMakeFiles/external_lib.dir/example/extensions/lib_external_ops/min_ex.cc.o -c /home/ubuntu/external_ops/example/extensions/lib_external_ops/min_ex.cc
[ 98%] Linking CXX shared library libexternal_lib.so
/usr/local/lib/python3.6/dist-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/external_lib.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC   -Wall -Wno-sign-compare -O3 -fopenmp  -shared -Wl,-soname,libexternal_lib.so -o libexternal_lib.so CMakeFiles/external_lib.dir/example/extensions/lib_external_ops/init_lib.cc.o CMakeFiles/external_lib.dir/example/extensions/lib_external_ops/min_ex.cc.o CMakeFiles/external_lib.dir/src/lib_api.cc.o -Wl,-rpath,/home/ubuntu/external_ops/build:/usr/local/lib:/home/ubuntu/external_ops/build/3rdparty/openmp/runtime/src libmxnet.so 3rdparty/mkldnn/src/libdnnl.a /usr/local/lib/libopenblas.so /usr/lib/x86_64-linux-gnu/librt.so /usr/local/lib/libopencv_highgui.so.4.2.0 /usr/local/lib/libopencv_videoio.so.4.2.0 /usr/local/lib/libopencv_imgcodecs.so.4.2.0 /usr/local/lib/libopencv_imgproc.so.4.2.0 /usr/local/lib/libopencv_core.so.4.2.0 3rdparty/openmp/runtime/src/libomp.so -ldl -lpthread -lpthread -llapack 3rdparty/dmlc-core/libdmlc.a /usr/lib/gcc/x86_64-linux-gnu/7/libgomp.so -lpthread -lrt 

After all, the use case for this feature is to be able to use custom components in a publicly released build of MXNet (without those components compiled in statically). So initially, those components were build right as part of an MXNet build at some point. So they will need to continue being built the same way. With all of MXNet's complicated layout (includes all over the place, not just in include/mxnet directory) and 3rd submodules its not currently possible to just build MXNet and link it against a custom library. The hardest part is figuring out all the defines/includes/etc to compile a library with.

In this new idea, theres a build target setup called "external_lib" in MXNet's CMakeLists.txt (just like we already have for other extensions examples like "pass_lib") that compiles all *.cc and *.cu files in the example/extensions/lib_external_ops directory. After copying your custom files into the lib_external_ops directory you can just go and run cmake and make to build your custom library.

Ive updated the README for the example as:
https://github.com/apache/incubator-mxnet/pull/18904/files#diff-70cbaa0e978356ecb01db8beb907ab48R31-R38

@samskalicky
Copy link
Contributor Author

@kpuatamazon I had a followup idea, I moved all the library specific build stuff into a CMakeLists.txt in the example/extensions/lib_external_ops directory. So now all of your library-specific build "stuff" is there and not in the main MXNet CMakeLists.txt. In this way, building a custom library does not require changing the main MXNet CMakeLists.txt, instead only the small/specific one for your custom library.

The main MXNet CMakeLists.txt just refers to the one in the lib_external_ops directory:
https://github.com/apache/incubator-mxnet/pull/18904/files#diff-af3b638bc2a3e6c650974192a53c7291R709

You still have to build MXNet, and then use MXNet's CMakeLists.txt to generate the Makefile for your custom library. But at least now the control is completely on the custom library. You can clone MXNet, build it, delete all the example files in example/extensions/lib_external_ops, drop in all of your files in the same directory, and build just the "external_lib" target.

@marcoabreu
Copy link
Contributor

So that does mean a workflow where people pip install mxnet and then build their own operator on top will not be possible? They will still have to maintain a fork, although the separation is a bit clearer now?

I think we can draw a lot of benefits if one does not have to compile mxnet itself as that would allow us to start working with optional features as separately disturbed and maintained components. Right now there seems to be a very tight coupling and I understand where it comes from, but the question is whether we see the vision as valuable and if we can work out a plan that would enable that way. I understand that the current build system is not built with that in mind, but mxnet 2.0 would give us the opportunity to break some things to pave the path.

@samskalicky
Copy link
Contributor Author

So that does mean a workflow where people pip install mxnet and then build their own operator on top will not be possible? They will still have to maintain a fork, although the separation is a bit clearer now?

I think we can draw a lot of benefits if one does not have to compile mxnet itself as that would allow us to start working with optional features as separately disturbed and maintained components. Right now there seems to be a very tight coupling and I understand where it comes from, but the question is whether we see the vision as valuable and if we can work out a plan that would enable that way. I understand that the current build system is not built with that in mind, but mxnet 2.0 would give us the opportunity to break some things to pave the path.

Yes, external operators will not be possible this way. If you want to not build MXNet you can use C++ Custom Ops (but not external ops)

@marcoabreu
Copy link
Contributor

How about components in general? Let's say we would like to externalize the onnx support or some other component? We would still always have to build these things together and specify the features set during compile time, right?

@samskalicky
Copy link
Contributor Author

How about components in general? Let's say we would like to externalize the onnx support or some other component? We would still always have to build these things together and specify the features set during compile time, right?

It depends on what "externalize the onnx support" entails. Today the only components we register are: operators, subgraph properties, and graph passes (anything else?). The current onnx support is implemented entirely in Python.

Maybe we could externalize TensorRT as a custom subgraph property & custom op though.

You dont need to build them together as much as you need to reproduce the same build settings (defines, includes, etc).

The big reason that you cant just compile your code and link against libmxnet.so is the same as why you cant build a custom application using the C++ API without compiling it with MXNet: it depends on the entire codebase (+3rd party submodules).

@@ -24,7 +24,8 @@
#
# $ cp config/linux_gpu.cmake config.cmake
#
# Next modify the according entries, and then compile by
# Next modify the entries in the config.cmake like MXNET_CUDA_ARCH to set the specific
# GPU architecture, and then compile by
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leezu for that feedback, how about this where we point out that users might want to set the MXNET_CUDA_ARCH when using the linux_gpu.cmake file. And then let them refer below for the specific details. This at least points out that they might need to do something with MXNET_CUDA_ARCH in order to build for GPU

# - "All" for all available GPU architectures supported by the version of CUDA installed
# - "specific GPU architectures" by giving the compute capability number such as
# "7.0" or "7.0;7.5" (ie. sm_70 or sm_75) or you can specify the name like:
# "Volta" or "Volta;Turing".
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leezu I tried to merge my version which what was there before. Hows this?

@samskalicky
Copy link
Contributor Author

Do you mean something like this: https://github.com/Zha0q1/incubator-mxnet/blob/static_openblas/cmake/exclude_openblas.ver?

Thanks @Zha0q1, @szha for windows the default is not to export any symbols (according to this: https://stackoverflow.com/q/29038914). So currently im getting this error:

in_ex.cc.obj : error LNK2019: unresolved external symbol "public: static class dmlc::Registry<class nnvm::Op> * __cdecl dmlc::Registry<class nnvm::Op>::Get(void)" (?Get@?$Registry@VOp@nnvm@@@dmlc@@SAPEAV12@XZ) referenced in function "void __cdecl mxnet::op::`dynamic initializer for '__make_NnvmOp_min_ex0''(void)" (??__E__make_NnvmOp_min_ex0@op@mxnet@@YAXXZ)

This seems to stem from missing exported symbols for nnvm stuff. Do we want to expose the identical set of symbols between windows/linux?

If not, then we should have an identical library loading procedure (like C++ custom ops) where we re-register components from the library into MXNet's internal registry. But this will limit the types of components a user can put in their library by explicitly requiring MXNet to implement support for each (instead of just making the symbols exposed and allowing users to load libraries that just hook into those symbols).

@szha
Copy link
Member

szha commented Aug 25, 2020

I think we should expose the MX/NN symbols, unless there's a downside to it.

@samskalicky
Copy link
Contributor Author

I think we should expose the MX/NN symbols, unless there's a downside to it.

The complication is how we go about doing the exporting. @yajiedesign maybe can add some more details, but a cursory search shows that you can use a cmake property WINDOWS_EXPORT_ALL_SYMBOLS to export everything and make it in line with how gcc works. But this will bloat the already bloated windows dlls. The other way to do it would be to go to the source files and add __declspec(dllexport) to only the symbol declarations we want to export. This is probably not feasible since it would mean we need to modify the 3rd party submodules as well. Plus, we'd need to do a bunch of work to figure out the set of symbols required manually.

So really the only option is to use a cmake property WINDOWS_EXPORT_ALL_SYMBOLS and accept the bloat. Unless we drop windows and just say that this "external components" feature is linux only (C++ Custom operators can still be used on windows).

@leezu
Copy link
Contributor

leezu commented Aug 25, 2020

Being explicit about the exported / supported symbols and thus components is preferable, as otherwise it becomes unclear which symbols are tracked as part of semantic versioning and which aren't. In fact, instead of setting WINDOWS_EXPORT_ALL_SYMBOLS, the best practice is actually to make gcc follow the Windows default of hiding symbols by default and only exposing whitelisted symbols. You can refer to the CPPCon talk: https://crascit.com/wp-content/uploads/2019/09/Deep-CMake-For-Library-Authors-Craig-Scott-CppCon-2019.pdf https://www.youtube.com/watch?v=m0DwB4OvDXk

Before proceeding here, I think we need to answer the question "What symbols do we need to make available in libmxnet.so for external operators?" @samskalicky, can clarify which symbols are needed for the external operators you are interested in?

@samskalicky
Copy link
Contributor Author

@samskalicky, can clarify which symbols are needed for the external operators you are interested in?

Not sufficiently, anything that anybody uses to write an internal/backend operator could be anything in MXNet/NNVM/TVM/mshadow/etc... so if the goal of the feature is enable any backend operator to be dynamically loaded, we have to export all symbols.

The biggest problem with individual symbol exposure many symbols needed arent entirely defined in MXNet source code. One example is that the operator registration macro NNVM_REGISTER_OP resolves to something that calls ::dmlc::Registry<::nnvm::Op>::Get() which is defined in 3rdparty/tvm/nnvm/src/core/op.cc. So unless we go and modify that 3rd party code we're stuck.

@samskalicky
Copy link
Contributor Author

samskalicky commented Aug 26, 2020

Dear community members,

After further analysis and help from many people, I have decided that the best way forward is to put this work on hold while we spend some more time improving the C++ APIs of both MXNet and our dependencies (nnvm, dmlc-core). Once we have a stable, maintainable, versioned set of C++ APIs and a consistent process to build and link libmxnet.so with external libraries we'll revisit this proposal.

The major problem we ran into was that we couldnt decide to expose all symbols in libmxnet.so, so to be able to link a custom library with external operators in it we need to export only the specific symbols needed to build an operator (or other external components). And some of these symbols are coming from our dependencies like NNVM (ie. NNVM_REGISTER_OP). So we will need to ensure these symbols are exported from those projects as well. This will mean building a libnnvm.so and linking it with libmxnet.so so that it can be more easily maintained. Custom libraries will then be able to link against libnnvm.so and libmxnet.so.

The other big problem was that we found that in order to compile an external operator you really had to use the main MXNet CMakeLists.txt to generate all the defines/includes/etc so that you can compile your library correctly. This was a huge pain, and ideally we would have a small set of includes that are opaque (ie. dont further include all of MXNet's includes) to use instead.

So we'll work towards both of these goals (symbol handling and includes) to reconfigure MXNet and its dependencies to have a cleaner C++ API. Once we have that then it will be easier to dynamically load libraries with external ops.

In the meantime, some of the refactoring work will continue on in another PR #19016. I will close this PR for now. Thanks for all of your input on this proposal!

Sam

@samskalicky samskalicky changed the title [RFC] MXNet external operators MXNet external operators Aug 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
RFC Post requesting for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants