-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Belos test failing on ATDM waterman builds #3338
Comments
@hkthorn : Do you have time to take a look ? |
@srajama1 : I will try to take a look by the end of the week. |
@fryeguy52 @bartlettroscoe Please let me know if this is still an issue after the OpenMPI version is changed for the ATDM waterman testing. |
Interestingly enough, the seg fault is happening in the condition estimation code. A quick grep of Belos shows that this test is the only test that includes, by default, the computation of a condition estimate:
So, that narrows down the search. If condition estimation is turned off in this test, it passes without seg fault. |
FYI, if I enable this option in the Epetra pseudo-block CG test, it will likewise seg fault on this platform. |
@hkthorn Does condition number estimation use Ifpack? That was always a bit funny. |
@mhoemmen Nope, it directly calls LAPACK's _STEQR to compute the eigenvalues of the symmetric matrix. It's the call to LAPACK that is causing the issue. I have hijacked the call to _STEQR within the BlockCG solver to feed it two simple, small vectors of that represent a diagonal matrix and the call to LAPACK still seg faults. |
@mhoemmen I did find that the STEQR templated wrapper was incorrectly defined in Teuchos_LAPACK.hpp, but the specialization for each of the four primary scalar types is correct. So, while that should be corrected, it is not the problem. |
FYI: We have been having problems with LAPACK on these IBM Power machines for some time. See #1208, #2454, #2410, etc. It would be great to find someone to get to the bottom of #2410 especially and to beef up the acceptance tests of for these functions. @jwillenbring, is this something Tech-X contractors could do? |
@hkthorn wrote:
How should it be defined? |
@mhoemmen There are some arguments that are MagnitudeType, not ScalarType. The work vector and the diagonal and off diagonal should be MagnitudeType. I can change that, no problem, but that's not the issue. Actually, the Teuchos LAPACK test has been disabled on this platform: ride/tweaks/CUDA-9.2-DEBUG-CUDA-POWER8-KEPLER37.cmake:ATDM_SET_ENABLE(TeuchosNumerics_LAPACK_test_MPI_1_DISABLE ON) This test fails exactly because the call to _STEQR seg faults within the test. Thus, I would say that there is an issue with the LAPACK library on this machine, as @bartlettroscoe has mentioned. So, that means that condition estimate computations performed in Belos pseudo-block CG will not succeed on this platform until that is resolved. |
Interestingly enough, in trying to debug this issue it has become apparent that the person who implemented condition estimation for the CG classes in Belos, did not finish the implementation for the single-vector BelosCGIter class. For the Tpetra test, where there is one right-hand side, the condition number comes back as NaN because the eigenvalues are all zeros. |
Oh, yeah, and the condition number computation can easily be wrong for multiple right-hand sides due to failures in the logic. |
The issue trilinos#3338 is the failure of the Tpetra pseudo-block CG test on an important test platform. There are two issues, one the _STEQR provided by the LAPACK libraries on that platform is seg faulting and, two, the general condition estimate computation performed by single-vector/pseudo-block CG is generally wrong. The condition estimate computation is now implemented for the single-vector CG kernel, when it wasn't before. Furthermore, it is corrected for the pseudo-block CG kernel, where it was wrongly storing information when there was more than one right-hand side. Tests have been fixed for the Tpetra pseudo-block CG, so that they output information, including the condition estimation. Tests have been augmented to perform condition estimation for Epetra and perform an unpreconditioned pseudo-block CG solve, using Epetra.
Fixes condition estimate computation to address issue #3338
@bartlettroscoe No problem. |
The test failures for pseudo-block CG mostly stemmed from a incomplete implementation of the condition estimation code, which was only tested for Tpetra. This has been fixed, per issue trilinos#3338, so the tests should be enabled again.
Re-enable Belos pseudo-block CG testing per issue #3338
The reported tests have been re-enabled and no longer fail on waterman. This was not an intermittent issue, so there is no need to wait to see if the failure is randomly occurring. Marking closed. |
The issue trilinos#3338 is the failure of the Tpetra pseudo-block CG test on an important test platform. There are two issues, one the _STEQR provided by the LAPACK libraries on that platform is seg faulting and, two, the general condition estimate computation performed by single-vector/pseudo-block CG is generally wrong. The condition estimate computation is now implemented for the single-vector CG kernel, when it wasn't before. Furthermore, it is corrected for the pseudo-block CG kernel, where it was wrongly storing information when there was more than one right-hand side. Tests have been fixed for the Tpetra pseudo-block CG, so that they output information, including the condition estimation. Tests have been augmented to perform condition estimation for Epetra and perform an unpreconditioned pseudo-block CG solve, using Epetra.
The test failures for pseudo-block CG mostly stemmed from a incomplete implementation of the condition estimation code, which was only tested for Tpetra. This has been fixed, per issue trilinos#3338, so the tests should be enabled again.
CC: @trilinos/belos , @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe
Next Action Status
PR #3363 merged to 'develop' on 8/27/2018 the test
Belos_Tpetra_MVOPTester_complex_test_MPI_4
passed on 8/28/2018 but the testBelos_Tpetra_PseudoBlockCG_hb_test_MPI_4
is still failing. PR #3454 merged on 9/19/2018 disabled this test in these waterman builds in commit cb9a9c9 and this test disappeared from thetrilinos-atdm-waterman-*debug*
builds after 2018-09-21 as shown here. Next: Fix these?Description
As shown in this query the tests:
are failing in some of the
Trilinos-atdm-waterman-*
buildsThe test
Belos_Tpetra_MVOPTester_complex_test_MPI_4
is failing on the two cuda-9.2 buildsThe test
Belos_Tpetra_PseudoBlockCG_hb_test_MPI_4
is failing on the two debug buildsSteps to Reproduce
One should be able to reproduce this failure on the machine waterman as described in:
More specifically, the commands given for the system waterman are provided at:
The exact commands to reproduce this issue should be:
The text was updated successfully, but these errors were encountered: