Belos test failing on ATDM waterman builds #3338

fryeguy52 · 2018-08-22T14:50:27Z

CC: @trilinos/belos , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe

Next Action Status

PR #3363 merged to 'develop' on 8/27/2018 the test Belos_Tpetra_MVOPTester_complex_test_MPI_4 passed on 8/28/2018 but the test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 is still failing. PR #3454 merged on 9/19/2018 disabled this test in these waterman builds in commit cb9a9c9 and this test disappeared from the trilinos-atdm-waterman-*debug* builds after 2018-09-21 as shown here. Next: Fix these?

Description

As shown in this query the tests:

Belos_Tpetra_MVOPTester_complex_test_MPI_4
Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4

are failing in some of the Trilinos-atdm-waterman-* builds

The test Belos_Tpetra_MVOPTester_complex_test_MPI_4 is failing on the two cuda-9.2 builds

Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman-cuda-9.2-debug

The test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 is failing on the two debug builds

Trilinos-atdm-waterman-gnu-debug-openmp
Trilinos-atdm-waterman-cuda-9.2-debug

Steps to Reproduce

One should be able to reproduce this failure on the machine waterman as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for the system waterman are provided at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman

The exact commands to reproduce this issue should be:


$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
  $TRILINOS_DIR

$ make NP=20

$ bsub -x -Is -n 20 ctest -j20

The text was updated successfully, but these errors were encountered:

srajama1 · 2018-08-22T16:40:29Z

@hkthorn : Do you have time to take a look ?

hkthorn · 2018-08-22T17:45:31Z

@srajama1 : I will try to take a look by the end of the week.

hkthorn · 2018-08-28T15:54:52Z

@fryeguy52 @bartlettroscoe Please let me know if this is still an issue after the OpenMPI version is changed for the ATDM waterman testing.

bartlettroscoe · 2018-08-28T18:09:46Z

After the merge of PR #3363 last night, the test Belos_Tpetra_MVOPTester_complex_test_MPI_4 is no longer failing in any of the 4 'waterman' builds as shown here but the test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 is still failing.

hkthorn · 2018-10-02T18:51:39Z

Interestingly enough, the seg fault is happening in the condition estimation code. A quick grep of Belos shows that this test is the only test that includes, by default, the computation of a condition estimate:

grep -r 'Estimate Condition Number'
src/BelosPseudoBlockCGSolMgr.hpp: \note Only works if "Estimate Condition Number" is set on parameterlist
src/BelosPseudoBlockCGSolMgr.hpp: if (params->isParameter ("Estimate Condition Number")) {
src/BelosPseudoBlockCGSolMgr.hpp: genCondEst_ = params->get ("Estimate Condition Number", genCondEst_default_);
src/BelosPseudoBlockCGSolMgr.hpp: pl->set("Estimate Condition Number", static_cast(genCondEst_default_),
tpetra/test/BlockCG/test_pseudo_bl_cg_hb.cpp: belosList->set("Estimate Condition Number", true);

So, that narrows down the search. If condition estimation is turned off in this test, it passes without seg fault.

hkthorn · 2018-10-02T18:58:56Z

FYI, if I enable this option in the Epetra pseudo-block CG test, it will likewise seg fault on this platform.

mhoemmen · 2018-10-02T19:42:57Z

@hkthorn Does condition number estimation use Ifpack? That was always a bit funny.

hkthorn · 2018-10-02T20:12:12Z

@mhoemmen Nope, it directly calls LAPACK's _STEQR to compute the eigenvalues of the symmetric matrix. It's the call to LAPACK that is causing the issue. I have hijacked the call to _STEQR within the BlockCG solver to feed it two simple, small vectors of that represent a diagonal matrix and the call to LAPACK still seg faults.

hkthorn · 2018-10-02T20:13:41Z

@mhoemmen I did find that the STEQR templated wrapper was incorrectly defined in Teuchos_LAPACK.hpp, but the specialization for each of the four primary scalar types is correct. So, while that should be corrected, it is not the problem.

bartlettroscoe · 2018-10-02T20:17:39Z

@mhoemmen, @hkthorn,

FYI: We have been having problems with LAPACK on these IBM Power machines for some time. See #1208, #2454, #2410, etc.

It would be great to find someone to get to the bottom of #2410 especially and to beef up the acceptance tests of for these functions. @jwillenbring, is this something Tech-X contractors could do?

mhoemmen · 2018-10-02T20:41:45Z

@hkthorn wrote:

I did find that the STEQR templated wrapper was incorrectly defined in Teuchos_LAPACK.hpp

How should it be defined?

hkthorn · 2018-10-02T21:08:06Z

@mhoemmen There are some arguments that are MagnitudeType, not ScalarType. The work vector and the diagonal and off diagonal should be MagnitudeType. I can change that, no problem, but that's not the issue. Actually, the Teuchos LAPACK test has been disabled on this platform:

ride/tweaks/CUDA-9.2-DEBUG-CUDA-POWER8-KEPLER37.cmake:ATDM_SET_ENABLE(TeuchosNumerics_LAPACK_test_MPI_1_DISABLE ON)
ride/tweaks/GNU-DEBUG-OPENMP-POWER8.cmake:ATDM_SET_ENABLE(TeuchosNumerics_LAPACK_test_MPI_1_DISABLE ON)

This test fails exactly because the call to _STEQR seg faults within the test. Thus, I would say that there is an issue with the LAPACK library on this machine, as @bartlettroscoe has mentioned. So, that means that condition estimate computations performed in Belos pseudo-block CG will not succeed on this platform until that is resolved.

bartlettroscoe · 2018-10-02T21:15:41Z

@hkthorn, issues with the NetLIB LAPACK are also reported in #3542. We might need to go back to the beginning and more carefully look at these installations of BLAS and LAPACK and get the LAPACK tests in Trilinos to work (see #2410).

hkthorn · 2018-10-04T18:52:17Z

Interestingly enough, in trying to debug this issue it has become apparent that the person who implemented condition estimation for the CG classes in Belos, did not finish the implementation for the single-vector BelosCGIter class. For the Tpetra test, where there is one right-hand side, the condition number comes back as NaN because the eigenvalues are all zeros.

hkthorn · 2018-10-04T19:02:44Z

Oh, yeah, and the condition number computation can easily be wrong for multiple right-hand sides due to failures in the logic.

bartlettroscoe · 2018-10-15T23:47:13Z

PR #3454 merged on 9/19/2018 disabled this test in these waterman builds in commit cb9a9c9 and this test disappeared from the trilinos-atdm-waterman-*debug* builds after 2018-09-21 as shown here.

I am marking this with with "Disabled Tests" and leaving open as per policy/process.

The issue trilinos#3338 is the failure of the Tpetra pseudo-block CG test on an important test platform. There are two issues, one the _STEQR provided by the LAPACK libraries on that platform is seg faulting and, two, the general condition estimate computation performed by single-vector/pseudo-block CG is generally wrong. The condition estimate computation is now implemented for the single-vector CG kernel, when it wasn't before. Furthermore, it is corrected for the pseudo-block CG kernel, where it was wrongly storing information when there was more than one right-hand side. Tests have been fixed for the Tpetra pseudo-block CG, so that they output information, including the condition estimation. Tests have been augmented to perform condition estimation for Epetra and perform an unpreconditioned pseudo-block CG solve, using Epetra.

Fixes condition estimate computation to address issue #3338

hkthorn · 2018-10-19T21:40:30Z

This issue is fixed per the merged commit #3658, so I will re-enable the tests on waterman. Note, the commit that disabled this test in the waterman scripts notes that is is because of #2466. That is related, but a separate issue to the seg fault that was being observed on waterman.

bartlettroscoe · 2018-10-19T21:42:01Z

This issue is fixed per the merged commit #3658, so I will re-enable the tests on waterman.

@hkthorn, okay thanks. We just need to keep this issue open until we get confirmation that the test is passing on CDash.

hkthorn · 2018-10-19T21:42:43Z

@bartlettroscoe No problem.

The test failures for pseudo-block CG mostly stemmed from a incomplete implementation of the condition estimation code, which was only tested for Tpetra. This has been fixed, per issue trilinos#3338, so the tests should be enabled again.

Re-enable Belos pseudo-block CG testing per issue #3338

hkthorn · 2018-11-27T00:27:16Z

The reported tests have been re-enabled and no longer fail on waterman. This was not an intermittent issue, so there is no need to wait to see if the failure is randomly occurring. Marking closed.

The issue trilinos#3338 is the failure of the Tpetra pseudo-block CG test on an important test platform. There are two issues, one the _STEQR provided by the LAPACK libraries on that platform is seg faulting and, two, the general condition estimate computation performed by single-vector/pseudo-block CG is generally wrong. The condition estimate computation is now implemented for the single-vector CG kernel, when it wasn't before. Furthermore, it is corrected for the pseudo-block CG kernel, where it was wrongly storing information when there was more than one right-hand side. Tests have been fixed for the Tpetra pseudo-block CG, so that they output information, including the condition estimation. Tests have been augmented to perform condition estimation for Epetra and perform an unpreconditioned pseudo-block CG solve, using Epetra.

The test failures for pseudo-block CG mostly stemmed from a incomplete implementation of the condition estimation code, which was only tested for Tpetra. This has been fixed, per issue trilinos#3338, so the tests should be enabled again.

fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: Belos client: ATDM Any issue primarily impacting the ATDM project labels Aug 22, 2018

fryeguy52 added this to the Initial cleanup of new ATDM builds of Trilinos milestone Aug 22, 2018

mhoemmen mentioned this issue Oct 2, 2018

Stokhos tests failing in Trilinos-atdm-white-ride-cuda-9.2-debug-pt build #3542

Closed

hkthorn mentioned this issue Oct 2, 2018

Test TeuchosNumerics_LAPACK_test_MPI_1 fails in all 'debug' builds on power8 'ride' #2410

Closed

bartlettroscoe added the Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue label Oct 15, 2018

hkthorn mentioned this issue Oct 17, 2018

Fixes condition estimate computation to address issue #3338 #3658

Merged

9 tasks

hkthorn added a commit that referenced this issue Oct 18, 2018

Merge pull request #3658 from hkthorn/develop

b64a5a4

Fixes condition estimate computation to address issue #3338

hkthorn mentioned this issue Nov 14, 2018

Re-enable Belos pseudo-block CG testing per issue #3338 #3875

Merged

9 tasks

bartlettroscoe added a commit that referenced this issue Nov 14, 2018

Merge pull request #3875 from hkthorn/develop

3c3d9df

Re-enable Belos pseudo-block CG testing per issue #3338

hkthorn closed this as completed Nov 27, 2018

bartlettroscoe added PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area and removed Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue labels Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Belos test failing on ATDM waterman builds #3338

Belos test failing on ATDM waterman builds #3338

fryeguy52 commented Aug 22, 2018 •

edited by bartlettroscoe

Loading

srajama1 commented Aug 22, 2018

hkthorn commented Aug 22, 2018

hkthorn commented Aug 28, 2018

bartlettroscoe commented Aug 28, 2018

hkthorn commented Oct 2, 2018

hkthorn commented Oct 2, 2018

mhoemmen commented Oct 2, 2018

hkthorn commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

mhoemmen commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

hkthorn commented Oct 4, 2018 •

edited

Loading

hkthorn commented Oct 4, 2018

bartlettroscoe commented Oct 15, 2018

hkthorn commented Oct 19, 2018

bartlettroscoe commented Oct 19, 2018

hkthorn commented Oct 19, 2018

hkthorn commented Nov 27, 2018

Belos test failing on ATDM waterman builds #3338

Belos test failing on ATDM waterman builds #3338

Comments

fryeguy52 commented Aug 22, 2018 • edited by bartlettroscoe Loading

Next Action Status

Description

Steps to Reproduce

srajama1 commented Aug 22, 2018

hkthorn commented Aug 22, 2018

hkthorn commented Aug 28, 2018

bartlettroscoe commented Aug 28, 2018

hkthorn commented Oct 2, 2018

hkthorn commented Oct 2, 2018

mhoemmen commented Oct 2, 2018

hkthorn commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

mhoemmen commented Oct 2, 2018

hkthorn commented Oct 2, 2018

bartlettroscoe commented Oct 2, 2018

hkthorn commented Oct 4, 2018 • edited Loading

hkthorn commented Oct 4, 2018

bartlettroscoe commented Oct 15, 2018

hkthorn commented Oct 19, 2018

bartlettroscoe commented Oct 19, 2018

hkthorn commented Oct 19, 2018

hkthorn commented Nov 27, 2018

fryeguy52 commented Aug 22, 2018 •

edited by bartlettroscoe

Loading

hkthorn commented Oct 4, 2018 •

edited

Loading