-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several ROL_example_PDE-OPT_XXX tests failing and timing out in 'waterman' CUDA builds starting 2019-10-17 #6124
Comments
@trilinos/rol, Should be pretty straightforward to get on 'waterman' and revert the three PRs one at at time shown here and see which one triggered these changes. As I mentioned above, it is possible the ATDM Trilinos configuration changes in PRs #6104 and #6105 caused this but I don't think so because the change in #6104 that sets
@trilinos/rol, could the enable of Intrepid cause these tests to fail? If so, why only on 'waterman'? Why not also on 'ride'? |
Those tests have always depended on Intrepid. I believe they also use Amesos2. Were they passing before these commits? Were they even enabled before the commits? If these are timeouts, I would look for performance issues in the upstream dependencies, or hardware/test setup issues (proc numbers, etc.). |
@dridzal, you may be on to something. Looking at this query, the number of ROL tests in the build The new ROL tests are shown in this query and we see the following 13 new tests:
That is the same set of failing tests shown above. Note that the following 18 new passing tests got added as well:
Shoot, I just noticed that these are the same 13 failing tests that where identified to be failing on 'ride' in the build The reason these same tests are not showing up in the non-RDC CUDA builds on 'ride' is that the 'ride' configuration currently is not enabling the extra SPARC packages and TPLs. Therefore, as shown in this query, we only see, for example, the test So rather than setting Otherwise, this issue is really just a duplicate of #3543. |
Agreed, for now. A new branch of ROL works on GPUs, and will be merged fairly soon. |
Since these tests will not work on any CUDA build, might as well take them out globally. As part of this I went ahead and disabled some failing Zoltan (trilinos#4042) and TrilinosCouplings (trilinos#3749) that don't pass for the Primary Tested CUDA builds. NOTE: The file ATDMDisables.cmake is **not** included in a PT (primary tested) build so these will still run therefore the PT builds showd the true state of these different build configurations.
Since these tests will not work on any CUDA build, might as well take them out globally. As part of this I went ahead and disabled some failing Zoltan (trilinos#4042) and TrilinosCouplings (trilinos#3749) that don't pass for the Primary Tested CUDA builds. NOTE: The file ATDMDisables.cmake is **not** included in a PT (primary tested) build so these will still run therefore the PT builds showd the true state of these different build configurations.
I tried using 'gnu-7.2.0' in the build name and it did not let me do it. So I cleaned up the handling of the parsing and matching some.
I created PR #6126 that disables these tests in all ATDM Trilinos non-PT CUDA builds. I also manually merged this branch to 'atdm-nightly' in commit b152209 so we can see the impact of this in nightly ATDM Trilinos testing (while the Trilinos PR tester is down). Just need to confirm that these test failures are gone no CDash and then we can add the "Disabled Tests" label and get this off of our list. |
@dridzal, there are more new ROL failing tests than just these 13 on 'waterman'. For example, see the randomly failing test
Since SPARC is not using Intrepid, and GEMMA and EMPIRE are not using ROL, I would like to just set Okay? |
…s:develop' (429b3db). * trilinos-develop: (38 commits) stk snapshot as of 10/21/2019 Ifpack2: Adding Richardson support to Ifpack2::Relaxation PullRequestLinuxDriver.sh: remove ".sandia.gov" from the no_proxy environemt variable MueLu: Stuff for running w/o a pre-smoother Removing dependence on optipack (better guards; removed XML ) trilinos#4957 Clean up handling of compiler parsing matching (trilinos#6124) ATDM: Disable ROL tests that don't work with CUDA (trilinos#6124) ATDM: Move broad disables of individual tests to the ATDMDisables.cmake file ATDM: Remove unused files after switching to waterman/tweaks/Tweaks.cmake SQUASH AGAINST: ATDM: Add support for a single <system_name>/tweaks/Tweaks.cmake file ATDM: Disable two Intrepid2 tests for gnu+openmp+debug build (trilinos#6020) ATDM: Disable several Tempus tests in all waterman debug builds (trilinos#6009) ATDM: Switch to usage of Tweaks.cmake file for 'ride' builds ATDM: Disable several Tempus tests in all waterman debug builds (trilinos#6009) SEACAS: Allow for disable of explore exe (trilinos#6008) ATDM: Disable SEACAS explore build on cuda+rdc builds (trilinos#6008) ATDM: Switch to use of Tweaks.cmake file for waterman builds ATDM: Add support for a single <system_name>/tweaks/Tweaks.cmake file Tempus: Cleanup some warnings. rythmos: fixed printing of Teuchos::SerialDense* to remove deprecated behavior. Following pattern in trilinos#5374. ...
…s:develop' (429b3db). * trilinos-develop: (38 commits) stk snapshot as of 10/21/2019 Ifpack2: Adding Richardson support to Ifpack2::Relaxation PullRequestLinuxDriver.sh: remove ".sandia.gov" from the no_proxy environemt variable MueLu: Stuff for running w/o a pre-smoother Removing dependence on optipack (better guards; removed XML ) trilinos#4957 Clean up handling of compiler parsing matching (trilinos#6124) ATDM: Disable ROL tests that don't work with CUDA (trilinos#6124) ATDM: Move broad disables of individual tests to the ATDMDisables.cmake file ATDM: Remove unused files after switching to waterman/tweaks/Tweaks.cmake SQUASH AGAINST: ATDM: Add support for a single <system_name>/tweaks/Tweaks.cmake file ATDM: Disable two Intrepid2 tests for gnu+openmp+debug build (trilinos#6020) ATDM: Disable several Tempus tests in all waterman debug builds (trilinos#6009) ATDM: Switch to usage of Tweaks.cmake file for 'ride' builds ATDM: Disable several Tempus tests in all waterman debug builds (trilinos#6009) SEACAS: Allow for disable of explore exe (trilinos#6008) ATDM: Disable SEACAS explore build on cuda+rdc builds (trilinos#6008) ATDM: Switch to use of Tweaks.cmake file for waterman builds ATDM: Add support for a single <system_name>/tweaks/Tweaks.cmake file Tempus: Cleanup some warnings. rythmos: fixed printing of Teuchos::SerialDense* to remove deprecated behavior. Following pattern in trilinos#5374. ...
@trilinos/rol, @dridzal, Any desire to fix the test |
@bartlettroscoe , I don't have time to work on this. Let's disable. Once our new GPU-compatible stack is merged into ROL develop, we'll want to enable all disabled tests on these machines. |
@dridzal, okay, I will disable and post a PR. |
FYI: As shown in this query the tests:
both randomly hang and timeout in the build:
showing the error:
when they hang. Otherwise, these tests pass i less than 40 seconds. As shown in this query these tests only thrown and hang in this one build and in no other builds. Therefore, I will disable these two tests as well in this one build as well ... |
…man-disable-more-rol-tests Automatically Merged using Trilinos Pull Request AutoTester PR Title: ATDM: Disable randomly hanging ROL_example_PDE-OPT_ginzburg-landau_example_0[1|2]_MPI_4tets (#6124) PR Author: bartlettroscoe
…s:develop' (9be559a). * trilinos-develop: ATDM: Disable randomly hanging ROL tets (trilinos#6124)
…s:develop' (9be559a). * trilinos-develop: ATDM: Disable randomly hanging ROL tets (trilinos#6124)
As shown in the below table, I believe that all of the failing tests associated with this issue have not been disabled. I will now add the "Disabled Tests" label and remove our tracking of these tests from our cdash_analyze_and_report.py tool. Missing tests for issue #6124 for testing day 2019-11-03Tests with issue trackers Missing
|
Remove our tracking of these tests with the TrilinosATDMStatus commit:
|
@trilinos/rol, if there is no desire to fix these tests so they don't timeout in these builds, then we should just close this issue. |
NOTE: As shown in this query, the tests:
are also timing out in the build:
on 'vortex':
Therefore, I will disable these tests in that build too. |
@bartlettroscoe , yes you can go ahead and disable them. A word of warning: These tests have very little to do with ROL @trilinos/rol performance. They reflect on the performance of our new linear algebra / solver stack, including Tpetra @trilinos/tpetra , Amesos2 @trilinos/amesos2 , etc. So if these tests ever passed on these platforms, but are now timing out, that would imply slowdowns in Trilinos' solver stack. If I were a developer for those packages, I would be concerned. If the tests never passed, then they can be ignored. Most tests that only use ROL components run in milliseconds. |
@dridzal, note that I have just disabled these timing out tests in debug builds so performance is not an issue in debug builds. |
These are close enough to being clean (especially after disabling tests associated with trilinos#6009 and trilinos#6124) that we can promote these to the 'Specialized' CDash group and monitor them for final cleanup. Some of the cuda+gnu+dbg failures are related to MPI_Init() and CUDA-Aware MPI (see trilinos#5855
…s:develop' (2bfd2c7). * trilinos-develop: (177 commits) Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class removed deprecated LAPACK::GEBAL method that takes ilo and ihi by value removed deprecated LAPACK::POSVX and LAPACK::GESVX methods that take EQUED by value removed deprecated LAPACK::TREXC method that takes ifst and ilst by value removed deprecated count method in ArrayRCP, RCP, and RCPNode removed deprecated PerformanceMonitorBase::clearTimer methods Intrepid2: Temporarily disabling tests failing on some machines (Issue trilinos#6246) Remove misspelled RTop_HIDE_DEPRECATED_CODE (trilinos#6217) Disable/hide deprecated code (trilinos#6217) ...
…s:develop' (2bfd2c7). * trilinos-develop: (186 commits) zoltan2: upgrading testing for issues fixed in trilinos#6375 tpetra: disable kokkos warnings in initialize tests Tacho - disable matrix market reader/writer test to improve PR test stability. kokkos: cmake fixes for clang +/- cuda kokkos/cmake/kokkos_arch.cmake: Fix for clang + NO cuda Fix some scopes in nlnml_nonlinearlevel.cpp Zoltan2: fix reversal of Cuthill McKee ordering Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) Intrepid2: reenabling JacobiLegendrePolynomial_Tests and Hierarchical_Basis_Tests. fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class ...
…s:develop' (2bfd2c7). * trilinos-develop: (186 commits) zoltan2: upgrading testing for issues fixed in trilinos#6375 tpetra: disable kokkos warnings in initialize tests Tacho - disable matrix market reader/writer test to improve PR test stability. kokkos: cmake fixes for clang +/- cuda kokkos/cmake/kokkos_arch.cmake: Fix for clang + NO cuda Fix some scopes in nlnml_nonlinearlevel.cpp Zoltan2: fix reversal of Cuthill McKee ordering Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) Intrepid2: reenabling JacobiLegendrePolynomial_Tests and Hierarchical_Basis_Tests. fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class ...
…s:develop' (2bfd2c7). * trilinos-develop: (186 commits) zoltan2: upgrading testing for issues fixed in trilinos#6375 tpetra: disable kokkos warnings in initialize tests Tacho - disable matrix market reader/writer test to improve PR test stability. kokkos: cmake fixes for clang +/- cuda kokkos/cmake/kokkos_arch.cmake: Fix for clang + NO cuda Fix some scopes in nlnml_nonlinearlevel.cpp Zoltan2: fix reversal of Cuthill McKee ordering Add a fix for a stk cmake file Promote atdm ats2 gnu+dbg and cuda+gnu+dbg to 'Specialized' (CDOFA-72) Intrepid2: remove unnecessary finalize calls in unit tests Disable STEQR() LAPACK test on ats2 deug builds (trilinos#2410, trilinos#6166) Disable some timing out ROL tests (trilinos#6124) Disable timing out Tempus tests on ats2 (trilinos#6009) Intrepid2: reenabling JacobiLegendrePolynomial_Tests and Hierarchical_Basis_Tests. fixed some broken teuchos unit tests and removed missed deprecated methods Promoting ats2+gnu+opt build which is 100% clean (CDOFA-27) removed deprecated overload of << in SerialDenseMatrix, SerialBandDenseMatrix, SerialSymDenseMatrix, and SerialDenseVector removed deprecated Teuchos::Comm helpers reduceAll and scan that take pointers to return arguments removed deprecated MPITraits class removed deprecated ArrayArg class ...
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
CC: @trilinos/rol, @rppawlo (Trilinos Nonlinear Solvesr Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Next: Try setting
ROL_ENABLE_Intrepid=OFF
in the ATDM Trilinos configuration and see if the ROL test failures go away?Description
As shown in this query over the two testing days 2019-10-17 and 2019-10-18 there 147 failing and timing out
ROL-exmaple_PDE-OPT_
tests involving the following 13 tests:ROL_example_PDE-OPT_0ld_adv-diff-react_example_01_MPI_4
ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4
ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4
ROL_example_PDE-OPT_0ld_stefan-boltzmann_example_03_MPI_4
ROL_example_PDE-OPT_helmholtz_example_02_MPI_1
ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4
ROL_example_PDE-OPT_navier-stokes_example_02_MPI_4
ROL_example_PDE-OPT_nonlinear-elliptic_example_01_MPI_4
ROL_example_PDE-OPT_nonlinear-elliptic_example_02_MPI_4
ROL_example_PDE-OPT_obstacle_example_01_MPI_4
ROL_example_PDE-OPT_stefan-boltzmann_example_01_MPI_4
ROL_example_PDE-OPT_stefan-boltzmann_example_03_MPI_4
ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4
in the following 6 'waterman' builds:
Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Trilinos-atdm-waterman_cuda-9.2_shared_opt
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
Trilinos-atdm-waterman-cuda-9.2-release-debug
As shown in this query, of the total 147 tests over those two days there were 15 timeouts which included the tests:
ROL_example_PDE-OPT_ginzburg-landau_example_01_MPI_4
(only on 2019-10-17)ROL_example_PDE-OPT_helmholtz_example_02_MPI_1
(both days but only in buildTrilinos-atdm-waterman-cuda-9.2-debug
)ROL_example_PDE-OPT_navier-stokes_example_01_MPI_4
(both days in all 6 'waterman' cuda builds)The non-timing out tests over the two days are shown in this query which shows there were 132 failing non-timing out tests. Many of those failing tests shown the error:
Accoding to this query, all 132 of those test failures showed:
This query shows that most of the
ROL_example_PDE-OPT_XXX
tests pass on these 'waterma' CUDA builds. It is only the 13 tests listed aboave having problems.The new commits that were pulled the day that these failures started 2019-10-17 are show, for example, here. That shows there were stk commits pulled in from PR #6098 and slight changes to the ATDM Trilinos configuration in PRs #6104 and #6105 that brought in the commits:
Current Status on CDash
Steps to Reproduce
One should be able to reproduce this failure on the machine 'watemran' as described in:
More specifically, the commands given for the system 'waterman' are provided at:
The exact commands to reproduce this issue, for the build
Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
for example, should be:The text was updated successfully, but these errors were encountered: