Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

Open
perdasilva opened this issue May 17, 2019 · 6 comments · Fixed by #17450
Open

[CI][NightlyTestsForBinaries] Test Large Tensor: GPU Failing #14981

perdasilva opened this issue May 17, 2019 · 6 comments · Fixed by #17450
Labels

Comments

@perdasilva
Copy link
Contributor

perdasilva commented May 17, 2019

Description

Test Large Tensor: GPU step is failing with:

======================================================================
ERROR: test_large_array.test_ndarray_random_randint
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/nightly/test_large_array.py", line 70, in test_ndarray_random_randint
    assert a.__gt__(low) & a.__lt__(high)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 336, in __gt__
    return greater(self, other)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 3376, in greater
    _internal._lesser_scalar)
  File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 2704, in _ufunc_helper
    return fn_array(lhs, rhs)
  File "<string>", line 46, in broadcast_greater
  File "/work/mxnet/python/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/work/mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [06:39:26] /work/mxnet/src/io/../operator/elemwise_op_common.h:135: Check failed: assign(&dattr, vec.at(i)): Incompatible attr in node  at 1-th input: expected int32, got int64
Stack trace:
  [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3c) [0x7fa0e59e8b3c]
  [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseAttr<int, &mxnet::op::type_is_none, &mxnet::op::type_assign, true, &mxnet::op::type_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, int const&)::{lambda(std::vector<int, std::allocator<int> > const&, unsigned long, char const*)#1}::operator()(std::vector<int, std::allocator<int> > const&, unsigned long, char const*) const+0x62d) [0x7fa0e8c6866d]
  [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseAttr<int, &mxnet::op::type_is_none, &mxnet::op::type_assign, true, &mxnet::op::type_string[abi:cxx11], -1l, -1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*, int const&)+0x2f3) [0x7fa0e8f963a3]
  [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(bool mxnet::op::ElemwiseType<2l, 1l>(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*)+0x34d) [0x7fa0e8f968ed]
  [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<bool (nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*), bool (*)(nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*, std::vector<int, std::allocator<int> >*)>::_M_invoke(std::_Any_data const&, nnvm::NodeAttrs const&, std::vector<int, std::allocator<int> >*&&, std::vector<int, std::allocator<int> >*&&)+0x1d) [0x7fa0e8bb909d]
  [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::SetShapeType(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, mxnet::DispatchMode*)+0x6a5) [0x7fa0e8c28e35]
  [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x10b) [0x7fa0e8c0f52b]
  [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(MXImperativeInvokeImpl(void*, int, void**, int*, void***, int, char const**, char const**)+0x1c9) [0x7fa0e8a8a479]
  [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(MXImperativeInvokeEx+0x8f) [0x7fa0e8a8a97f]


-------------------- >> begin captured logging << --------------------
tests.python.unittest.common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2073509752 to reproduce.
--------------------- >> end captured logging << ---------------------

see http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/312/pipeline/144 for the full log

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, CI

@vdantu
Copy link
Contributor

vdantu commented May 19, 2019

@mxnet-label-bot add [test]
@apeforest

@roywei
Copy link
Member

roywei commented May 21, 2019

@roywei
Copy link
Member

roywei commented Jun 4, 2019

actually, we can't close it yet, this test was fixed but went back to failing after #15059. Similar OOM issue in #14980

@roywei
Copy link
Member

roywei commented Jun 4, 2019

Currently, both CPU and GPU tests have been disabled due to the same memory issue. Had a discussion with @access2rohit and @apeforest, we can try a few things:

  1. change to P3 instances here https://github.com/apache/incubator-mxnet/blob/master/tests/nightly/JenkinsfileForBinaries#L82
  2. further increase shared memory to 50G
  3. stop running large tensor test parallelly with other tests.

We are having problems testing the above solutions on CI machines that have multiple jobs running in parallel.

@roywei
Copy link
Member

roywei commented Jun 6, 2019

failed with 200G shared memory on P3.2x and failed, we need another approach for testing large tensor.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants