-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test TeuchosNumerics_LAPACK_test_MPI_1 fails in all 'debug' builds on power8 'ride' #2410
Comments
What is strange about this is that we see only 1 failing Belos test and 2 failing Anasazi tests for the build and only the one Belos test looks to be segfaulting. But with the corresponding build there are 8 failing Belos tests and 50 failing Anasazi tests which a lot of them are segfaults but this this test Again, I suspect this might be related to to #1208 that concluded that this compiler on this system is just broken when it comes to producing mixed-language programs. |
Note that this test actually passes in all of the So that compiler defect analyzed in #1208 does not seem to be impacting this test when compiler optimizations are turned on. |
Since this test is passing in all of the other builds, including all of the Do any of you @trilinos/teuchos developers have a problem with that? I will create a PR that disables this test for just the builds where it fails which are:
I will create a PR for this. |
…debug builds (trilinos#2410) This passes in the corresponding RELEASE builds so it is okay to disable this test in these DEBUG builds as we are not really loosing any real testing. And note that almost all of the Belos and Anasazi and other tests that use this LAPACK interface pass in these DEBUG builds so disabling this one test is not too serious. But if someone wants to try to debug it and then turn it back on, more power to them. Note that this is likey related to the mix-lanauage compiler defect reported in trilinos#1208. That seems to be a strange compiler defect.
So we can see what tests are added or not and why. This was useful for checking on disabling some tests for trilinos#2410.
I created PR #2447 that disables this test for just these builds. Can some @trilinos/teuchos developer please approve this PR? Otherwise, since this is not really changing Trilinos at all (just the ATDM Trilinos test drivers), it can't possibly break Trilinos so I will merge in a few hours. I just want to see what the auto PR tester does for a PR like this. I suspect that because it touches a |
…debug builds (#2410) This passes in the corresponding RELEASE builds so it is okay to disable this test in these DEBUG builds as we are not really loosing any real testing. And note that almost all of the Belos and Anasazi and other tests that use this LAPACK interface pass in these DEBUG builds so disabling this one test is not too serious. But if someone wants to try to debug it and then turn it back on, more power to them. Note that this is likey related to the mix-lanauage compiler defect reported in #1208. That seems to be a strange compiler defect.
So we can see what tests are added or not and why. This was useful for checking on disabling some tests for #2410.
The PR #2447 has been merged. These failures should go away tomorrow. Putting in review. |
The test and And this test is shown passing in all of the remaining builds in: Therefore, closing this as complete. |
Closing as complete for real (I seems to always forget to click that "Close and comment" button). |
As part the updated process listed at: I assigned the labels "Disabled Tests" and "Stalled" so I am reopening this issue. That way, people can be reminded that this is still a problem (we just avoided it by disabling tests for the ATDM builds). |
…g' builds on 'white'/'ride' (trilinos#2410) Trying to run again with updated NetLIB BLAS and LAPACK (see trilinos#2454).
Given the updated NetLIB BLAS and LAPACK being used on 'white' and 'ride' (see #2454) I tried re-enabling the test
that returned:
The failing tests were:
The failing test output for the build
That is identical to the failure output reported above when this Issue was first created. Also, I verified in the configure output that the new NetLIB BLAS and LAPACK installs are indeed being used as it showed:
So whatever is causing this test to fail did not get fixed by updating BLAS and LAPACK. Perhaps there is a real defect in the Teuchos LAPACK wrappers and/or the test code for this test? |
@bartlettroscoe - to re-comment as well, the OpenBLAS tests for the packages we install (fairly limited number) do also all pass and do have C to FORTRAN calling included. That’s what makes me wonder if this might be a deeper issue too. |
This test is also failing on both debug builds on waterman, shown here those builds are:
They fail with similar output:
Should we disable |
@fryeguy52 @bartlettroscoe Please let me know if the LAPACK test is still failing after the OpenMPI version is changed for the ATDM waterman testing. |
Replace working tridiagonal eigensolver with the one that causes the seg fault on ride/white/waterman.
This is the only function that continues to segfault on Power8 and Power9 builds using NETLIB LAPACK in debug builds. All of the Trilinos code downstream has removed a dependence on this LAPACK function. This will allow the rest of the LAPACK tests to be run in Power8 and Power9 'debug' builds.
…EQR() test (trilinos#2410) At least this way we are running the Teuchos LAPACK tests for the LAPACK functions being used downstream in Trilinos.
PR #4064 which enables the whole test Putting 'in review' awaiting results on CDash ... |
The test
Note As of right now, all of the 'debug' builds (but not the 'release-debug' builds) on 'white', 'ride', and 'waterman' have already posted to CDash and we can see this test running and passing in 32 of the ATM Trilinos builds in this query. And there are no more disables for this test in any of the ATDM Trilinos builds as shown on 'develop' just now by the commands:
Therefore, I think this is sufficient evidence that this test is running and passing in every ATM Trilinos build, including all of the 'debug' builds on 'white', 'ride', and 'waterman'. The only thing that is not running is the I have added the "Disabled Tests" label but I actually think now that we have got Trilinos to stop using this LAPACK |
…s:develop' (19158f2). * trilinos-develop: Add back TeuchosNumerics_DISABLE_STEQR_TEST=ON (trilinos#2410, trilinos#6166) MueLu: fixed build error kokkos-kernels: update gcc check for c++14 workaround macro Ifpack2 ScaledDampedResidual: Cache vectors Tpetra/MueLu: switched performance tests to StackedTimer
…s:develop' (19158f2). * trilinos-develop: Add back TeuchosNumerics_DISABLE_STEQR_TEST=ON (trilinos#2410, trilinos#6166) MueLu: fixed build error kokkos-kernels: update gcc check for c++14 workaround macro Ifpack2 ScaledDampedResidual: Cache vectors Tpetra/MueLu: switched performance tests to StackedTimer
…s:develop' (19158f2). * trilinos-develop: Add back TeuchosNumerics_DISABLE_STEQR_TEST=ON (trilinos#2410, trilinos#6166) MueLu: fixed build error kokkos-kernels: update gcc check for c++14 workaround macro Ifpack2 ScaledDampedResidual: Cache vectors Tpetra/MueLu: switched performance tests to StackedTimer
…s:develop' (2bfd2c7). * trilinos-develop: (177 commits) Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class removed deprecated LAPACK::GEBAL method that takes ilo and ihi by value removed deprecated LAPACK::POSVX and LAPACK::GESVX methods that take EQUED by value removed deprecated LAPACK::TREXC method that takes ifst and ilst by value removed deprecated count method in ArrayRCP, RCP, and RCPNode removed deprecated PerformanceMonitorBase::clearTimer methods Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246) Remove misspelled RTop_HIDE_DEPRECATED_CODE (trilinos#6217) Disable/hide deprecated code (trilinos#6217) ...
…s:develop' (2bfd2c7). * trilinos-develop: (186 commits) zoltan2: upgrading testing for issues fixed in trilinos#6375 tpetra: disable kokkos warnings in initialize tests Tacho - disable matrix market reader/writer test to improve PR test stability. kokkos: cmake fixes for clang +/- cuda kokkos/cmake/kokkos_arch.cmake: Fix for clang + NO cuda Fix some scopes in nlnml_nonlinearlevel.cpp Zoltan2: fix reversal of Cuthill McKee ordering Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) Intrepid2: reenabling JacobiLegendrePolynomial_Tests and Hierarchical_Basis_Tests. fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class ...
…s:develop' (2bfd2c7). * trilinos-develop: (186 commits) zoltan2: upgrading testing for issues fixed in trilinos#6375 tpetra: disable kokkos warnings in initialize tests Tacho - disable matrix market reader/writer test to improve PR test stability. kokkos: cmake fixes for clang +/- cuda kokkos/cmake/kokkos_arch.cmake: Fix for clang + NO cuda Fix some scopes in nlnml_nonlinearlevel.cpp Zoltan2: fix reversal of Cuthill McKee ordering Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) Intrepid2: reenabling JacobiLegendrePolynomial_Tests and Hierarchical_Basis_Tests. fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class ...
…s:develop' (2bfd2c7). * trilinos-develop: (186 commits) zoltan2: upgrading testing for issues fixed in trilinos#6375 tpetra: disable kokkos warnings in initialize tests Tacho - disable matrix market reader/writer test to improve PR test stability. kokkos: cmake fixes for clang +/- cuda kokkos/cmake/kokkos_arch.cmake: Fix for clang + NO cuda Fix some scopes in nlnml_nonlinearlevel.cpp Zoltan2: fix reversal of Cuthill McKee ordering Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) Intrepid2: reenabling JacobiLegendrePolynomial_Tests and Hierarchical_Basis_Tests. fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class ...
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
CC: @trilinos/teuchos
Next Action Status:
PR #2447 was merged on 3/23/2018 which disabled the test. PR #4064 which enables the whole test
TeuchosNumerics_LAPACK_test_MPI_1
but disables the single unit test forSTEQR()
merged to 'develop' on 12/18/2018. Next: Watch for test running and passing (minusSTEQR()
unit test) on 'release-debug' and 'opt' builds on 'white', 'ride', and 'waterman' on 12/19/2018 ...Description
The test
TeuchosNumerics_LAPACK_test_MPI_1
segfaults on the 'debug' buildsTrilinos-atdm-white-ride-cuda-debug
andTrilinos-atdm-white-ride-gnu-debug-openmp
on 'ride' and 'white' but passes in all of the 'opt' builds on these same machines as well as for all of the builds onhansen
as shown this morning in:The failing tests all show segfaults showing the output:
What is interesting is that this test only failed in all of the Trilinos builds that were done yesterday in the query:
May this be the same error reported in #1208 that we basically gave up on?
Steps to Reproduce
Following the instructions at:
one can reproduce this failing test by enabling the Teuchos package for the builds
gnu-debug-openmp
orcuda-debug
and running the failing test.Related issues
The text was updated successfully, but these errors were encountered: