Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-244] Work around likely compiler bug on nested inlines and temporary acces… #13535

Merged
merged 3 commits into from
Jan 7, 2019

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Dec 4, 2018

…s to stream

Description

This fixes a segfault in ARMv7 on throwing a fatal error on the lapack undefined functions.

Addresses #13342
We don't crash as before, the test fails which is expected.

I think this is a bug in the compiler due to the temporary LogMessageFatal created on unwrapping the macro https://github.com/apache/incubator-mxnet/blob/master/src/operator/c_lapack_api.h#L369 and too many inlined functions. The segfault is inside the ostream in libc, could be related to the lifetime of the temporary variable with a reference to the stream which is a very corner case of temporary variable lifetime.

Having the functions not inlined solves the problem.

qemu@qemu:~/mxnet/build/tests$ ./mxnet_unit_tests --gtest_filter="inv_khatri_rao.OneInputMatrixTransposed"

Note: Google Test filter = inv_khatri_rao.OneInputMatrixTransposed
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from inv_khatri_rao
[ RUN      ] inv_khatri_rao.OneInputMatrixTransposed
unknown file: Failure
C++ exception with description "[13:27:04] /work/mxnet/src/operator/c_lapack_api.cc:42: MXNet build without lapack. Function dposv is not available.

Stack trace returned 10 entries:
[bt] (0) ./mxnet_unit_tests(dmlc::StackTrace()+0x40) [0x1291afc]
[bt] (1) ./mxnet_unit_tests(dmlc::LogMessageFatal::~LogMessageFatal()+0x40) [0x1291e2c]
[bt] (2) ./mxnet_unit_tests(mxnet_lapack_dposv(...)+0x88) [0x19466dc]
[bt] (3) ./mxnet_unit_tests(int MXNET_LAPACK_posv<double>(int, char, int, int, double*, int, double*, int)+0x58) [0x13d7070]
[bt] (4) ./mxnet_unit_tests(void mxnet::op::inv_khatri_rao<double>(mshadow::Tensor<mshadow::cpu, 2, double>, std::vector<mshadow::Tensor<mshadow::cpu, 2, double>, std::allocator<mshadow::Tensor<mshadow::cpu, 2, double> > > const&, bool)+0x394) [0x13da000]
[bt] (5) ./mxnet_unit_tests(mxnet::op::inv_khatri_rao_OneInputMatrixTransposed_Test::TestBody()+0x238) [0x13cf0e8]
[bt] (6) ./mxnet_unit_tests(void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x74) [0x1491b50]
[bt] (7) ./mxnet_unit_tests(void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x4c) [0x148bb10]
[bt] (8) ./mxnet_unit_tests(testing::Test::Run()+0xe8) [0x146d9f8]
[bt] (9) ./mxnet_unit_tests(testing::TestInfo::Run()+0x138) [0x146e310]

" thrown in the test body.
[  FAILED  ] inv_khatri_rao.OneInputMatrixTransposed (70 ms)
[----------] 1 test from inv_khatri_rao (74 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (90 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] inv_khatri_rao.OneInputMatrixTransposed

 1 FAILED TEST

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@larroy larroy requested a review from anirudh2290 as a code owner December 4, 2018 18:33
@larroy larroy changed the title Work around likely compiler bug on nested inlines and temporary acces… [MXNET-244] Work around likely compiler bug on nested inlines and temporary acces… Dec 4, 2018
@larroy larroy force-pushed the armv7_undef_lapack_segfault branch from 7ceac60 to c4af669 Compare December 4, 2018 18:39
Copy link
Contributor

@jlcontreras jlcontreras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these files in the src/operator/ dir? (Seems strange to me, as the rest of the files in there belong to operators)

@larroy
Copy link
Contributor Author

larroy commented Dec 6, 2018

@jlcontreras good point, but that was there already, I think its because lapack is used in some operators. Do you have a better suggestion?

@nswamy nswamy added the pr-awaiting-review PR is waiting for code review label Dec 7, 2018
@jlcontreras
Copy link
Contributor

Difficult to say, maybe common/ ?

@larroy
Copy link
Contributor Author

larroy commented Dec 7, 2018

@jlcontreras do you think it makes sense to keep it in a different PR?

@@ -375,7 +364,10 @@ inline void flip(int m, int n, DType *b, int ldb, DType *a, int lda) {

MXNET_LAPACK_CWRAPPER3(ssyevd, float)
MXNET_LAPACK_CWRAPPER3(dsyevd, double)

#undef MXNET_LAPACK_CWRAPPER1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid being defined again in the implementation

@@ -325,41 +325,30 @@ inline void flip(int m, int n, DType *b, int ldb, DType *a, int lda) {
#else

// use pragma message instead of warning
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why commenting out? Maybe it should go to c_lapack_api.cc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

@larroy larroy force-pushed the armv7_undef_lapack_segfault branch from c4af669 to 2f7d5d0 Compare December 7, 2018 18:03
@asmushetzel
Copy link
Contributor

LGTM. Don't think that it makes sense to move the c_lapack_api to a different directory now. There are pros and cons about where it should sit, but no immediate benefit from moving.

@larroy
Copy link
Contributor Author

larroy commented Dec 11, 2018

@szha @zhreshold

@larroy
Copy link
Contributor Author

larroy commented Dec 11, 2018

@mxnet-label-bot [pr-awaiting-merge]

@larroy larroy force-pushed the armv7_undef_lapack_segfault branch 2 times, most recently from 1e5df93 to 609595a Compare December 17, 2018 14:15
@KellenSunderland
Copy link
Contributor

Any idea why the CI failed? Looks like a timeout right?

@larroy larroy force-pushed the armv7_undef_lapack_segfault branch from 609595a to 69adf38 Compare January 2, 2019 14:19
@larroy
Copy link
Contributor Author

larroy commented Jan 3, 2019

@KellenSunderland I guess we can merge now. Thank you.

@larroy
Copy link
Contributor Author

larroy commented Jan 7, 2019

@szha @KellenSunderland @tqchen could we please merge this? it's needed for fixing crashes on some scenarios. Thank you.

@szha szha merged commit 6dae0bf into apache:master Jan 7, 2019
rondogency pushed a commit to rondogency/incubator-mxnet that referenced this pull request Jan 9, 2019
…porary acces… (apache#13535)

* Work around likely compiler bug on nested inlines and temporary access to stream

* Don't compile khatri_rao tests if we don't have LAPACK

* Address CR comment
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…porary acces… (apache#13535)

* Work around likely compiler bug on nested inlines and temporary access to stream

* Don't compile khatri_rao tests if we don't have LAPACK

* Address CR comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants