-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teko_testdriver_tpetra_MPI_4 randomly timing out/hanging in all CUDA 'waterman' builds starting 2019-11-14 #6463
Comments
For example, for the build
But compare this to the runtimes for the test
I wonder if this is due to the CPU and/or GPUs getting hot and throttling back speed? Steven Oliver suggested this is happening on these types of machines and it is mentioned in: Can such factors of 5x or more differences in runtime be caused by such issues on these machines? Is this just par-for-the-course when testing on these types of systems? |
Unless there is an objection, I will just disable this randomly timing out test in all of the ATDM Trilinos builds on 'waterman'. This test will still run on the builds on 'ride' and will also run on 'vortex' once we get those builds up and running. |
No objection
…Sent from my iPad
On Jan 6, 2020, at 9:24 AM, Roscoe A. Bartlett <[email protected]> wrote:
Unless there is an objection, I will just disable this randomly timing out test in all of the ATDM Trilinos builds on 'waterman'. This test will still run on the builds on 'ride' and will also run on 'vortex' once we get those builds up and running.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub<#6463?email_source=notifications&email_token=ACHYSH22HPLM47B7KRK5JGLQ4NLMFA5CNFSM4J37GVE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIF6NQA#issuecomment-571205312>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACHYSH2X37HJNXCFTGJNAH3Q4NLMFANCNFSM4J37GVEQ>.
|
@eric-c-cyr, I created PR #6600 for this disable. Can you approve that PR? In any case, I manaully merged that branch to the 'atdm-nightly' branch for tomorrow in commit |
…man-disable-teko-test Automatically Merged using Trilinos Pull Request AutoTester PR Title: Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (#6463) PR Author: bartlettroscoe
With the merge of #6600, we can add the "Disabled Tests" label and remove this issue from our main list of issues. @trilinos/teko, if there is no interest in trying to get this test running better on 'waterman' (and I don't know why there would be since it runs fine everywhere else including 'ride' and on 'sems-rhel7' builds with CUDA and GPUs), then can we just close this as "WONTFIX"? |
…s:develop' (d17489d). * trilinos-develop: SEACAS: go back to lib:fmt 6.0.0 until fix issue on vortex xl/cuda build Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (trilinos#6463) zoltan2: add missing include file for non-ETI builds Tpetra: Missed ifdef guard zoltan2: name change to prevent shadow warnings zoltan2: Change logic for determining gno types to use in tests Simplified now that Trilinos builds only one gno_t Tpetra: More stacked timer fixes Tpetra: Fix overflow for TpetraCore_MatrixMatrix_UnitTests tpetra: removed unused field from FixedHashTable This removes some warnings about calling host functions from host device functions trilinos#5698; E.g., warning: calling a __host__ function("std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string") from a __host__ __device__ function("Tpetra::Details::LocalMap<int, long long, ::Kokkos::Device< ::Kokkos::Serial, ::Kokkos::HostSpace> > ::~LocalMap [subobject]") is not allowed tpetra: when looking at trilinos#6158, I saw that, when creating a Vector from a MultiVector, we created subviews for comm buffers, but did not store them. This commit stores them. It also offsets the buffers by the vector j requested from the MultiVector.
…s:develop' (d17489d). * trilinos-develop: SEACAS: go back to lib:fmt 6.0.0 until fix issue on vortex xl/cuda build Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (trilinos#6463) zoltan2: add missing include file for non-ETI builds Tpetra: Missed ifdef guard zoltan2: name change to prevent shadow warnings zoltan2: Change logic for determining gno types to use in tests Simplified now that Trilinos builds only one gno_t Tpetra: More stacked timer fixes Tpetra: Fix overflow for TpetraCore_MatrixMatrix_UnitTests tpetra: removed unused field from FixedHashTable This removes some warnings about calling host functions from host device functions trilinos#5698; E.g., warning: calling a __host__ function("std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string") from a __host__ __device__ function("Tpetra::Details::LocalMap<int, long long, ::Kokkos::Device< ::Kokkos::Serial, ::Kokkos::HostSpace> > ::~LocalMap [subobject]") is not allowed tpetra: when looking at trilinos#6158, I saw that, when creating a Vector from a MultiVector, we created subviews for comm buffers, but did not store them. This commit stores them. It also offsets the buffers by the vector j requested from the MultiVector.
…s:develop' (d17489d). * trilinos-develop: tpetra: In trilinos#6598, @mhoemmen recommended this change of offset SEACAS: go back to lib:fmt 6.0.0 until fix issue on vortex xl/cuda build Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (trilinos#6463) zoltan2: add missing include file for non-ETI builds Tpetra: Missed ifdef guard zoltan2: name change to prevent shadow warnings zoltan2: Change logic for determining gno types to use in tests Simplified now that Trilinos builds only one gno_t Tpetra: More stacked timer fixes Tpetra: Fix overflow for TpetraCore_MatrixMatrix_UnitTests tpetra: removed unused field from FixedHashTable This removes some warnings about calling host functions from host device functions trilinos#5698; E.g., warning: calling a __host__ function("std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string") from a __host__ __device__ function("Tpetra::Details::LocalMap<int, long long, ::Kokkos::Device< ::Kokkos::Serial, ::Kokkos::HostSpace> > ::~LocalMap [subobject]") is not allowed tpetra: when looking at trilinos#6158, I saw that, when creating a Vector from a MultiVector, we created subviews for comm buffers, but did not store them. This commit stores them. It also offsets the buffers by the vector j requested from the MultiVector.
…s:develop' (d17489d). * trilinos-develop: tpetra: In trilinos#6598, @mhoemmen recommended this change of offset Tpetra::CrsMatrix: Add Kokkos kernel labels; expose debug code Tpetra::CrsMatrix: Remove values2D_ Tpetra::CrsGraph: Remove gblInds2D_ Tpetra::CrsGraph: Remove lclInds2D_ Tpetra::CrsMatrix: Remove unused method allocateValues2D Tpetra: Use verbosePrintCountThreshold in copyOffsets Tpetra::Details::Behavior: Add longRowMinNumEntries Tpetra::Details::Behavior: Factor out size_t reading SEACAS: go back to lib:fmt 6.0.0 until fix issue on vortex xl/cuda build Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (trilinos#6463) zoltan2: add missing include file for non-ETI builds Tpetra: Missed ifdef guard zoltan2: name change to prevent shadow warnings zoltan2: Change logic for determining gno types to use in tests Simplified now that Trilinos builds only one gno_t Tpetra: More stacked timer fixes Tpetra: Fix overflow for TpetraCore_MatrixMatrix_UnitTests tpetra: removed unused field from FixedHashTable This removes some warnings about calling host functions from host device functions trilinos#5698; E.g., warning: calling a __host__ function("std::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string") from a __host__ __device__ function("Tpetra::Details::LocalMap<int, long long, ::Kokkos::Device< ::Kokkos::Serial, ::Kokkos::HostSpace> > ::~LocalMap [subobject]") is not allowed Framework: updating the autotester env to remove dependency on atdm-env tpetra: when looking at trilinos#6158, I saw that, when creating a Vector from a MultiVector, we created subviews for comm buffers, but did not store them. This commit stores them. It also offsets the buffers by the vector j requested from the MultiVector.
…s:develop' (d17489d). * trilinos-develop: (29 commits) Tpetra: Correcting unecessary extraction of remotes for A matrix in Jacobi Tpetra: MMM Modifications to avoid remotemap construction in serial Ifpack2: pass correct "symmetric" flag to MTSGS setup MueLu: Adding timer granularity to SaP Fix const_cast in Tpetra, Amesos and Ifpack2 SEACAS: Remove locale setting Teuchos utils: fix another stacked timer plotting bug tpetra: In trilinos#6598, @mhoemmen recommended this change of offset Tpetra::CrsMatrix: Add Kokkos kernel labels; expose debug code Tpetra::CrsMatrix: Remove values2D_ Tpetra::CrsGraph: Remove gblInds2D_ Tpetra::CrsGraph: Remove lclInds2D_ Tpetra::CrsMatrix: Remove unused method allocateValues2D Tpetra: Use verbosePrintCountThreshold in copyOffsets Tpetra::Details::Behavior: Add longRowMinNumEntries Tpetra::Details::Behavior: Factor out size_t reading SEACAS: go back to lib:fmt 6.0.0 until fix issue on vortex xl/cuda build Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (trilinos#6463) zoltan2: add missing include file for non-ETI builds Tpetra: Missed ifdef guard ...
…s:develop' (d17489d). * trilinos-develop: (29 commits) Tpetra: Correcting unecessary extraction of remotes for A matrix in Jacobi Tpetra: MMM Modifications to avoid remotemap construction in serial Ifpack2: pass correct "symmetric" flag to MTSGS setup MueLu: Adding timer granularity to SaP Fix const_cast in Tpetra, Amesos and Ifpack2 SEACAS: Remove locale setting Teuchos utils: fix another stacked timer plotting bug tpetra: In trilinos#6598, @mhoemmen recommended this change of offset Tpetra::CrsMatrix: Add Kokkos kernel labels; expose debug code Tpetra::CrsMatrix: Remove values2D_ Tpetra::CrsGraph: Remove gblInds2D_ Tpetra::CrsGraph: Remove lclInds2D_ Tpetra::CrsMatrix: Remove unused method allocateValues2D Tpetra: Use verbosePrintCountThreshold in copyOffsets Tpetra::Details::Behavior: Add longRowMinNumEntries Tpetra::Details::Behavior: Factor out size_t reading SEACAS: go back to lib:fmt 6.0.0 until fix issue on vortex xl/cuda build Disable Teko_testdriver_tpetra_MPI_4 in all atdm 'waterman' builds (trilinos#6463) zoltan2: add missing include file for non-ETI builds Tpetra: Missed ifdef guard ...
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
CC: @trilinos/teko, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Description
As shown in this query and this query over the time period [2019-08-01, 2019-12-17], the test:
Teko_testdriver_tpetra_MPI_4
in the builds:
Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Trilinos-atdm-waterman_cuda-9.2_shared_opt
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
Trilinos-atdm-waterman-cuda-9.2-release-debug
started randomly timing out (or hanging) on testing day 2019-11-14. It randomly timed out (or hung) a total of 29 times between 2019-11-14 and 2019-12-17 in those 6 builds. (Therefore, it times out only occasionally).
As shown in this query, the runtimes over the time period [2019-11-10, 2019-12-17] for this test are very erratic. When it passes, it can pass in just over a minute and take between the lowest time and timing out at 10 minutes. Since these CUAA builds now all use
ctest -j4
(see #6051) and since this is an MPI-4 test, this test should be running by itself on its own node on its own GPU. So the big differences may be due to non-determinism in the code.There is no sense in looking for commits that could have triggered this as this is a rarely randomly failing test.
Current Status on CDash
Steps to Reproduce
One should be able to reproduce this failure on the machine 'waterman' as described in:
More specifically, the commands given for the system 'waterman' are provided at:
The exact commands to reproduce this issue should be:
However, given this is a rarely randomly failing test, this could be hard to reproduce.
The text was updated successfully, but these errors were encountered: