Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Belos test failing on ATDM waterman builds #3338

Closed
fryeguy52 opened this issue Aug 22, 2018 · 20 comments
Closed

Belos test failing on ATDM waterman builds #3338

fryeguy52 opened this issue Aug 22, 2018 · 20 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Belos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented Aug 22, 2018

CC: @trilinos/belos , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe

Next Action Status

PR #3363 merged to 'develop' on 8/27/2018 the test Belos_Tpetra_MVOPTester_complex_test_MPI_4 passed on 8/28/2018 but the test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 is still failing. PR #3454 merged on 9/19/2018 disabled this test in these waterman builds in commit cb9a9c9 and this test disappeared from the trilinos-atdm-waterman-*debug* builds after 2018-09-21 as shown here. Next: Fix these?

Description

As shown in this query the tests:

  • Belos_Tpetra_MVOPTester_complex_test_MPI_4
  • Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4

are failing in some of the Trilinos-atdm-waterman-* builds

The test Belos_Tpetra_MVOPTester_complex_test_MPI_4 is failing on the two cuda-9.2 builds

  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-waterman-cuda-9.2-debug

The test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 is failing on the two debug builds

  • Trilinos-atdm-waterman-gnu-debug-openmp
  • Trilinos-atdm-waterman-cuda-9.2-debug

Steps to Reproduce

One should be able to reproduce this failure on the machine waterman as described in:

More specifically, the commands given for the system waterman are provided at:

The exact commands to reproduce this issue should be:


$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Belos=ON \
  $TRILINOS_DIR

$ make NP=20

$ bsub -x -Is -n 20 ctest -j20
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: Belos client: ATDM Any issue primarily impacting the ATDM project labels Aug 22, 2018
@srajama1
Copy link
Contributor

@hkthorn : Do you have time to take a look ?

@hkthorn
Copy link
Contributor

hkthorn commented Aug 22, 2018

@srajama1 : I will try to take a look by the end of the week.

@hkthorn
Copy link
Contributor

hkthorn commented Aug 28, 2018

@fryeguy52 @bartlettroscoe Please let me know if this is still an issue after the OpenMPI version is changed for the ATDM waterman testing.

@bartlettroscoe
Copy link
Member

After the merge of PR #3363 last night, the test Belos_Tpetra_MVOPTester_complex_test_MPI_4 is no longer failing in any of the 4 'waterman' builds as shown here but the test Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4 is still failing.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

Interestingly enough, the seg fault is happening in the condition estimation code. A quick grep of Belos shows that this test is the only test that includes, by default, the computation of a condition estimate:

grep -r 'Estimate Condition Number'
src/BelosPseudoBlockCGSolMgr.hpp: \note Only works if "Estimate Condition Number" is set on parameterlist
src/BelosPseudoBlockCGSolMgr.hpp: if (params->isParameter ("Estimate Condition Number")) {
src/BelosPseudoBlockCGSolMgr.hpp: genCondEst_ = params->get ("Estimate Condition Number", genCondEst_default_);
src/BelosPseudoBlockCGSolMgr.hpp: pl->set("Estimate Condition Number", static_cast(genCondEst_default_),
tpetra/test/BlockCG/test_pseudo_bl_cg_hb.cpp: belosList->set("Estimate Condition Number", true);

So, that narrows down the search. If condition estimation is turned off in this test, it passes without seg fault.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

FYI, if I enable this option in the Epetra pseudo-block CG test, it will likewise seg fault on this platform.

@mhoemmen
Copy link
Contributor

mhoemmen commented Oct 2, 2018

@hkthorn Does condition number estimation use Ifpack? That was always a bit funny.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

@mhoemmen Nope, it directly calls LAPACK's _STEQR to compute the eigenvalues of the symmetric matrix. It's the call to LAPACK that is causing the issue. I have hijacked the call to _STEQR within the BlockCG solver to feed it two simple, small vectors of that represent a diagonal matrix and the call to LAPACK still seg faults.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

@mhoemmen I did find that the STEQR templated wrapper was incorrectly defined in Teuchos_LAPACK.hpp, but the specialization for each of the four primary scalar types is correct. So, while that should be corrected, it is not the problem.

@bartlettroscoe
Copy link
Member

@mhoemmen, @hkthorn,

FYI: We have been having problems with LAPACK on these IBM Power machines for some time. See #1208, #2454, #2410, etc.

It would be great to find someone to get to the bottom of #2410 especially and to beef up the acceptance tests of for these functions. @jwillenbring, is this something Tech-X contractors could do?

@mhoemmen
Copy link
Contributor

mhoemmen commented Oct 2, 2018

@hkthorn wrote:

I did find that the STEQR templated wrapper was incorrectly defined in Teuchos_LAPACK.hpp

How should it be defined?

@hkthorn
Copy link
Contributor

hkthorn commented Oct 2, 2018

@mhoemmen There are some arguments that are MagnitudeType, not ScalarType. The work vector and the diagonal and off diagonal should be MagnitudeType. I can change that, no problem, but that's not the issue. Actually, the Teuchos LAPACK test has been disabled on this platform:

ride/tweaks/CUDA-9.2-DEBUG-CUDA-POWER8-KEPLER37.cmake:ATDM_SET_ENABLE(TeuchosNumerics_LAPACK_test_MPI_1_DISABLE ON)
ride/tweaks/GNU-DEBUG-OPENMP-POWER8.cmake:ATDM_SET_ENABLE(TeuchosNumerics_LAPACK_test_MPI_1_DISABLE ON)

This test fails exactly because the call to _STEQR seg faults within the test. Thus, I would say that there is an issue with the LAPACK library on this machine, as @bartlettroscoe has mentioned. So, that means that condition estimate computations performed in Belos pseudo-block CG will not succeed on this platform until that is resolved.

@bartlettroscoe
Copy link
Member

@hkthorn, issues with the NetLIB LAPACK are also reported in #3542. We might need to go back to the beginning and more carefully look at these installations of BLAS and LAPACK and get the LAPACK tests in Trilinos to work (see #2410).

@hkthorn
Copy link
Contributor

hkthorn commented Oct 4, 2018

Interestingly enough, in trying to debug this issue it has become apparent that the person who implemented condition estimation for the CG classes in Belos, did not finish the implementation for the single-vector BelosCGIter class. For the Tpetra test, where there is one right-hand side, the condition number comes back as NaN because the eigenvalues are all zeros.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 4, 2018

Oh, yeah, and the condition number computation can easily be wrong for multiple right-hand sides due to failures in the logic.

@bartlettroscoe
Copy link
Member

PR #3454 merged on 9/19/2018 disabled this test in these waterman builds in commit cb9a9c9 and this test disappeared from the trilinos-atdm-waterman-*debug* builds after 2018-09-21 as shown here.

I am marking this with with "Disabled Tests" and leaving open as per policy/process.

@bartlettroscoe bartlettroscoe added the Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue label Oct 15, 2018
hkthorn added a commit to hkthorn/Trilinos that referenced this issue Oct 17, 2018
The issue trilinos#3338 is the failure of the Tpetra pseudo-block CG test on an important test platform.

There are two issues, one the _STEQR provided by the LAPACK libraries
on that platform is seg faulting and, two, the general condition estimate
computation performed by single-vector/pseudo-block CG is generally wrong.
The condition estimate computation is now implemented for the single-vector
CG kernel, when it wasn't before. Furthermore, it is corrected for the
pseudo-block CG kernel, where it was wrongly storing information when
there was more than one right-hand side.

Tests have been fixed for the Tpetra pseudo-block CG, so that they output
information, including the condition estimation.  Tests have been augmented
to perform condition estimation for Epetra and perform an unpreconditioned
pseudo-block CG solve, using Epetra.
hkthorn added a commit that referenced this issue Oct 18, 2018
Fixes condition estimate computation to address issue #3338
@hkthorn
Copy link
Contributor

hkthorn commented Oct 19, 2018

This issue is fixed per the merged commit #3658, so I will re-enable the tests on waterman. Note, the commit that disabled this test in the waterman scripts notes that is is because of #2466. That is related, but a separate issue to the seg fault that was being observed on waterman.

@bartlettroscoe
Copy link
Member

This issue is fixed per the merged commit #3658, so I will re-enable the tests on waterman.

@hkthorn, okay thanks. We just need to keep this issue open until we get confirmation that the test is passing on CDash.

@hkthorn
Copy link
Contributor

hkthorn commented Oct 19, 2018

@bartlettroscoe No problem.

hkthorn added a commit to hkthorn/Trilinos that referenced this issue Nov 14, 2018
The test failures for pseudo-block CG mostly stemmed from a incomplete
implementation of the condition estimation code, which was only tested
for Tpetra.  This has been fixed, per issue trilinos#3338, so the tests should
be enabled again.
bartlettroscoe added a commit that referenced this issue Nov 14, 2018
Re-enable Belos pseudo-block CG testing per issue #3338
@hkthorn
Copy link
Contributor

hkthorn commented Nov 27, 2018

The reported tests have been re-enabled and no longer fail on waterman. This was not an intermittent issue, so there is no need to wait to see if the failure is randomly occurring. Marking closed.

@hkthorn hkthorn closed this as completed Nov 27, 2018
@bartlettroscoe bartlettroscoe added PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area and removed Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue labels Nov 27, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
The issue trilinos#3338 is the failure of the Tpetra pseudo-block CG test on an important test platform.

There are two issues, one the _STEQR provided by the LAPACK libraries
on that platform is seg faulting and, two, the general condition estimate
computation performed by single-vector/pseudo-block CG is generally wrong.
The condition estimate computation is now implemented for the single-vector
CG kernel, when it wasn't before. Furthermore, it is corrected for the
pseudo-block CG kernel, where it was wrongly storing information when
there was more than one right-hand side.

Tests have been fixed for the Tpetra pseudo-block CG, so that they output
information, including the condition estimation.  Tests have been augmented
to perform condition estimation for Epetra and perform an unpreconditioned
pseudo-block CG solve, using Epetra.
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
The test failures for pseudo-block CG mostly stemmed from a incomplete
implementation of the condition estimation code, which was only tested
for Tpetra.  This has been fixed, per issue trilinos#3338, so the tests should
be enabled again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: Belos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants