Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet Extensions enhancements2 #19016

Merged
merged 51 commits into from
Sep 1, 2020

Conversation

samskalicky
Copy link
Contributor

@samskalicky samskalicky commented Aug 26, 2020

Description

This PR contains a few enhancements for MXNet extensions:

  • Refactors MXLoadLib to return the handle to the loaded library and expects the user (or language binding) to close the library later.
  • Refactors lib_api.h by moving the function definitions into lib_api.cc to simplify building extensions
  • Refactored relu_lib example to split into multiple files, separating CUDA code from CXX code.
  • Updates all the examples in example/extensions to use the new lib_api.cc build flow & removes Symbol tests (now that Symbol API is deprecated) and replaces them with Gluon tests

Other changes

Improves the instructions in config/linux_gpu.cmake for building for specific GPU architectures.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@mxnet-bot
Copy link

Hey @samskalicky , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [edge, clang, centos-cpu, sanity, unix-cpu, windows-gpu, website, windows-cpu, unix-gpu, centos-gpu, miscellaneous]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@samskalicky
Copy link
Contributor Author

@leezu for review of the cmake config file changes and @mseth10 and @rondogency for review of everything else

config/linux_gpu.cmake Outdated Show resolved Hide resolved
Comment on lines +703 to +708
add_library(customop_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/example/extensions/lib_custom_op/gemm_lib.cc ${CMAKE_CURRENT_SOURCE_DIR}/src/lib_api.cc)
add_library(transposecsr_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/example/extensions/lib_custom_op/transposecsr_lib.cc ${CMAKE_CURRENT_SOURCE_DIR}/src/lib_api.cc)
add_library(transposerowsp_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/example/extensions/lib_custom_op/transposerowsp_lib.cc ${CMAKE_CURRENT_SOURCE_DIR}/src/lib_api.cc)
add_library(subgraph_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/example/extensions/lib_subgraph/subgraph_lib.cc ${CMAKE_CURRENT_SOURCE_DIR}/src/lib_api.cc)
add_library(pass_lib SHARED ${CMAKE_CURRENT_SOURCE_DIR}/example/extensions/lib_pass/pass_lib.cc ${CMAKE_CURRENT_SOURCE_DIR}/src/lib_api.cc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline discussion with @leezu, in another PR we should move this build code into CMakeLists.txt for each example and use add_subdirectory to include it and replace the current Makefiles so theres only 1 set of build steps for each example.

LibraryInitializer::~LibraryInitializer() {
close_open_libs();
}
LibraryInitializer::~LibraryInitializer() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an assertion which checks if all handles have been closed during shutdown of mxnet? That could allow to catch a possible leak.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed since there are cases where (for external ops for example) an object is registered in an MXNet data structure like the operator registry and during shutdown the object is attempted to be destructed but its pointing to an object in the loaded library. This ended up causing a segfault. Without closing open handles, we let loaded libraries live longer than libmxnet.so and allow it to shutdown cleanly.

This is why we changed MXLoadLib to return the handle to the library and call dlclose on the handle in Python.

However, this still isnt an issue since on process exit the library will be closed by the OS anyway when it cleans up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but hence an assertion or some other debug log to make the user aware that there are still unclosed handles. The idea is to give a hint during shutdown. If somebody then sees hundreds of unclosed handles, that could be a strong indicator of something being wrong - I'm not talking about automatic closing of the handle, just a message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cant check if handles are closed while libmxnet.so is closing. It would have to be checked elsewhere so the loaded libraries can live during libmxnet.so shutdown. If we put in a check it will print out to the user every time since the expectation is that the handles are closed after libmxnet.so

We're not talking about hundreds of handles, we're talking about one or maybe two libraries loaded by the user explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I was not aware that the goal was that these libraries have a longer lifecycle than libmxnet.so itself. Thanks for elaborating

Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.setBackward(backwardGPU, "gpu");

MXReturnValue initialize(int version) {
if (version >= 20000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure about this? since gemm_lib is still 10700

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should update all the examples to 20000 on master. ill do that in the next PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline and we will change example corresponding to master

@@ -54,25 +54,25 @@
print("indices:", c.indices.asnumpy())
print("indptr:", c.indptr.asnumpy())

print("--------start symbolic compute--------")
print("--------start Gluon compute--------")
d = mx.sym.Variable('d')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this sym also get deprecated then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we still use Symbol class, but we removed all the usages like bind

self.handle = handle
def __del__(self):
libdl = ctypes.CDLL("libdl.so")
libdl.dlclose(self.handle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is to elaborate libdl.so lives longer than libmxnet.so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is how we close the library we loaded before, by using libdl.so to call dlcose

/*!
* Copyright (c) 2019 by Contributors
* \file lib_api.cc
* \brief APIs to interact with libraries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we say "extension API to load dynamic loaded custom libraries", and also say it will depend on lib_api.h (2-stop file instead of 1-stop header file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, lets work on what the new text should be and update it in another PR.

@samskalicky samskalicky merged commit 8379740 into apache:master Sep 1, 2020
@samskalicky samskalicky mentioned this pull request Sep 1, 2020
6 tasks
samskalicky added a commit that referenced this pull request Sep 3, 2020
* initial commit

* fixed c++17 downgrade

* fixed stringstream

* fixed cast

* changed to use pointers for stringstream since not copyable

* fixed includes

* fixed makefile includes

* skipped lint for malloc/free for passing across C ABI

Co-authored-by: Ubuntu <[email protected]>
@ZiyueHuang
Copy link
Member

Hi @samskalicky , it seems that now we need src/lib_api.cc to build the custom operator, how can a user access src/lib_api.cc if using pip install mxnet?

@samskalicky
Copy link
Contributor Author

samskalicky commented Oct 13, 2020

Hi @samskalicky , it seems that now we need src/lib_api.cc to build the custom operator, how can a user access src/lib_api.cc if using pip install mxnet?

Hi @ZiyueHuang the lib_api.cc file can be accessed by downloading directly from github:
https://raw.githubusercontent.com/apache/incubator-mxnet/1.8.0.rc1/src/lib_api.cc
or
https://raw.githubusercontent.com/apache/incubator-mxnet/master/src/lib_api.cc

@ZiyueHuang
Copy link
Member

If the user install mxnet via pip, only the header files in mxnet/include and the python files in mxnet/python are downloaded. Then if the user wants to use the custom operator, the user must additionally download lib_api.cc of the specific mxnet version. I think it is more convenient if the custom operator only depends on the header file.

@samskalicky
Copy link
Contributor Author

If the user install mxnet via pip, only the header files in mxnet/include and the python files in mxnet/python are downloaded. Then if the user wants to use the custom operator, the user must additionally download lib_api.cc of the specific mxnet version. I think it is more convenient if the custom operator only depends on the header file.

Hi @ZiyueHuang thanks for that feedback. Initially I was thinking the same way, we only had the lib_api.h file for MXNet versions 1.6 and 1.7. But in 1.8 we realized that the single lib_api.h approach severely limited how the user had to construct their custom library. Since at that time the lib_api.h actually included function definitions, not just declarations, it was causing duplicate symbol errors if the file was included in multiple places.

So in 1.8 we split the code in lib_api.h into lib_api.cc so that the header file only included declarations and all definitions were in lib_api.cc. This allows users more flexibility to organize their code in multiple files.

The pip wheel is organized as the easy path to install in order to run Python programs using MXNet. But any C/C++ compilation for MXNet requires cloning the whole repo. Building a library for custom ops only requires the lib_api.cc/h files, so its an improvement in that direction. Maybe we could consider bundling the lib_api.cc file in the pip wheel too.

@samskalicky
Copy link
Contributor Author

@ZiyueHuang in #19393 i propose adding the lib_api.cc file into the pip wheel to help easily build extensions libraries without downloading from github: 3d28436
Will you review/approve that PR?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants