-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PanzerAdaptersSTK_[Mixed]CurlLaplacianExample tests failing in Trilinos-atdm-waterman-cuda-9.2-release-debug build #3939
Comments
@mperego and @rppawlo, as of now, these are the only failures stopping us from promoting the build Also note that this is the only ATDM CUDA build of Trilinos that enables optimized compiler options and turns on some Kokkos debug-mode runtime checking. We are seeing failures like shown here:
and here showing:
and here showing:
and here showing:
So, several different types of failures. I will try setting up a 'release-debug' build on 'white'/'ride' to see what happens. |
So did this just start happening recently? I thought we had things cleaned up. There was a push to panzer this morning but I'm guessing these tests are from last night? |
@rppawlo, no, these have been failing for several days as shown in the table below. (Click the numbers and see the history.) As I noted above, this is a bit of a special build in that it uses optimized compiler options (
|
Just had a quick talk with @egphill . We think the best thing to do is disable these failing tests for all cuda builds for now. The same tests are passing for lower order and it looks like high order has a memory issue. We will try to address soon. |
@rppawlo, okay. So can you disable these on your end or do you want us to disable these for the ATDM CUDA builds as per: Or you can use that approach and post a PR. |
I talked with @fryeguy52 and he is going to create a PR to disable these tests in all CUDA builds for now using the approach documented at: |
Thanks @bartlettroscoe and @fryeguy52 ! |
issue trilinos#3939 Disabled: PanzerAdaptersSTK_MixedCurlLaplacianExample PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-1 PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-2 PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-3 PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-2
PR #3962 was merged to 'develop' on 11/29/2018 that should disable these tests in all CUDA builds. But due to TRIL-237, we have not gotten any results from CDash since 11/27/2018. But, we can see from other builds on 11/30/2018 on CDash like here that show these four tests as missing:
Therefore, I think we can assume these tests will be disabled in this CUDA build on waterman as well once we get that back running again. Therefore, I think we can add the "Disabled Tests" label. @rppawlo, but I would assume that we should not close this issue because you will want to fix these tests and re-enable them? |
yes - but there's no rush (or manpower) at the moment. Do you want to take it off the ATDM board? |
@rppawlo said:
I added the "Disabled Tests" label so this issue has disappeared from the links in the "Unresolved ATDM Trilinos GitHub Issues in your Trilinos Product Area" email that goes out every Monday morning. You can see the full set of ATDM Trilinos issues that have been addressed by disabling all of the tests related to that issue and have the label "Disabled Tests" in the link |
@rppawlo, is the fact these tests are disabled mean that functionality used by EMPIRE is not being tested? |
This was done for empire - the commit that added these tests was actually done by an empire developer. However, it is experimental and they are just evaluating at this point. They don't plan to use on the gpu any time soon. So disabling for gpu builds is fine. If this pans out and they decide to use in production empire, then we will make fixing for gpu a high priority. |
@rppawlo, thanks for the explanation. Based on this I added the "ATDM Nonblocker" label as per ATDM Severity/Criticality Labels. |
…ebug-pt build (trilinos#2464, trilinos#3633) We really need to switch most of these 'debug' builds to 'release-debug' builds (see trilinos#3633). Also, the Trilinos CUDA PR build really needs to be a cuda-9.2-release-debug build since that runs more tests and catches more issues than either a cuda-9.2-opt or cuda-9.2-debug build (see trilinos#3939).
issue trilinos#3939 Disabled: PanzerAdaptersSTK_MixedCurlLaplacianExample PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-1 PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Tri-Order-2 PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-3 PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-2
@rppawlo, for all CUDA builds like we did the others or just for this CUDA build? |
all cuda builds is fine for now. |
@rppawlo said:
@fryeguy52, I think you need to disable all of the Panzer tests in all CUDA builds on all platforms shown in this query. |
you could just disable in debug-release for now and leave opt-release enabled. The failures look to be triggered from debug checks. I'm hoping to get to this before the break. |
@rppawlo, does the mean that the code is fine and the debug checks are defective or does it mean that the code has a defect but passes anyway when not running the debug checks? |
If I could answer that, the fix would be done :) Hope to get to this soon. |
Indeed :-) @fryeguy52, for now please disable the new failing tests shown in this query just for this one 'release-debug' build. |
see issue trilinos#3939 This disables tests: * PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-1 * PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-3 * PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 * PanzerAdaptersSTK_CurlLaplacianMultiblockExample-ConvTest-Quad-Order-1 * PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-1 * PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-2 * PanzerAdaptersSTK_MixedCurlLaplacianMultiblockExample-ConvTest-Quad-Order-1 in the build `Trilinos-atdm-waterman-cuda-9.2-release-debug`
This moves some panzer test disables that were effecting all cuda builds to only effect the waterman release-debug build issue trilinos#3939
see issue trilinos#3939 This disables tests: * PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-1 * PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-3 * PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 * PanzerAdaptersSTK_CurlLaplacianMultiblockExample-ConvTest-Quad-Order-1 * PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-1 * PanzerAdaptersSTK_MixedCurlLaplacianExample-ConvTest-Quad-Order-2 * PanzerAdaptersSTK_MixedCurlLaplacianMultiblockExample-ConvTest-Quad-Order-1 in the build `Trilinos-atdm-waterman-cuda-9.2-release-debug` This also moves some panzer test disables that were effecting all cuda builds to only effect the waterman release-debug build
With the merge of PR #4079 to 'develop' on 12/19/2018, these tests should now be disabled in this build Actually, since we are still running the ATDM Trilinos builds against raw 'develop', the build |
The Panzer tests in the build
These were disabled along with the tests already disabled shown in the table below. Since all of the Panzer tests have passed for the last two days with all of these disables, I have added the "Disabled Tests" label to get this off of our main list of ATDM Trilinos bug issues. But I will leave this open as a reminder that these still need to be fixed. Tests with issue trackers Missing: twim=5 (On 2018-12-20)
|
So I spent some time on this and will need some help from the kokkos or testbeds teams. I can run the tests manually and they all consistently pass. I can run the tests with ctest -j1 and they consistently all pass. The only way I can trigger the failure is to get multiple executables running at the same time by using ctest -j16. This seems like multiple executables are corrupting memory on the cuda. @crtrott @nmhamster
|
Should also comment that cuda memcheck comes back clean when I run manually. |
@rppawlo, this might be a good use case to experiment with the improved binding approach being discussed in: In particular, see this comment. That looks very promising. |
…s:develop' (7db7806). * trilinos-develop: (23 commits) Fix cmake-file error in stk_balance that was making the m2n exe be a test. tpetra: minor fix; return the values Fix incorrect line length in copy_string change Automatic snapshot commit from seacas at f9bf59a SEACAS: cgns - support self-looping models Disable failing ROL test already known to fail in CUA builds (trilinos#3543) Disable known failing Panzer tests (trilinos#3632) Small formatting change to comment (trilinos#3939) Enable SPARC TPLs and packages on 'waterman' (ATDV-151) ShyLU/FROSch: Correct use of booleans for interface components Don't allow Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt to run on 'ride7' (ATDV-155) tpetra: minor additional deprecations trilinos#4839 MiniEM: Fix discrete gradient tpetra: changes to address Mark's comments on trilinos#4839 Xpetra: MueLu: fix issue 4038 ShyLU/FROSch: Use insertGlobalValues instead of insertLocalValues for GlobalCoarseMatrix stokhos: fix compilation error due to tpetra deprecation changes Thyra: fixed compilation error due to deprecation changes tpetra: More deprecations of function arguments involving Node. create*MapWithNode generate_miniFM_* Tpetra: removing Node from argument lists of functions Completed MatrixMarket_Tpetra functions (readSparse, readDense, etc.) Also removed a few compiler warnings reported in clang ...
…s:develop' (7db7806). * trilinos-develop: (30 commits) Fix cmake-file error in stk_balance that was making the m2n exe be a test. Tpetra: Global Ordinal validation tpetra: minor fix; return the values Fix incorrect line length in copy_string change Tpetra: Moved GORDS logic to right file this time, really. Tpetra: GORDS Deprecation Cleanup Tpetra: Relocated # GORDS validation logic to packages/tpetra/core/CMakeLists.txt Tpetra: clean up deprecation WIP tags Tpetra: Add deprecations for global ordinal types Automatic snapshot commit from seacas at f9bf59a SEACAS: cgns - support self-looping models Disable failing ROL test already known to fail in CUA builds (trilinos#3543) Disable known failing Panzer tests (trilinos#3632) Small formatting change to comment (trilinos#3939) Enable SPARC TPLs and packages on 'waterman' (ATDV-151) ShyLU/FROSch: Correct use of booleans for interface components Don't allow Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt to run on 'ride7' (ATDV-155) tpetra: minor additional deprecations trilinos#4839 Ifpack2 - fix issue 4858 MiniEM: Fix discrete gradient ...
…s:develop' (7db7806). * trilinos-develop: (30 commits) Fix cmake-file error in stk_balance that was making the m2n exe be a test. Tpetra: Global Ordinal validation tpetra: minor fix; return the values Fix incorrect line length in copy_string change Tpetra: Moved GORDS logic to right file this time, really. Tpetra: GORDS Deprecation Cleanup Tpetra: Relocated # GORDS validation logic to packages/tpetra/core/CMakeLists.txt Tpetra: clean up deprecation WIP tags Tpetra: Add deprecations for global ordinal types Automatic snapshot commit from seacas at f9bf59a SEACAS: cgns - support self-looping models Disable failing ROL test already known to fail in CUA builds (trilinos#3543) Disable known failing Panzer tests (trilinos#3632) Small formatting change to comment (trilinos#3939) Enable SPARC TPLs and packages on 'waterman' (ATDV-151) ShyLU/FROSch: Correct use of booleans for interface components Don't allow Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt to run on 'ride7' (ATDV-155) tpetra: minor additional deprecations trilinos#4839 Ifpack2 - fix issue 4858 MiniEM: Fix discrete gradient ...
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
CC: @trilinos/panzer, @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
With the merge of PR #4079 to 'develop' on 12/19/2018, these tests should now be disabled in this build
Trilinos-atdm-waterman-cuda-9.2-release-debug
. All tests that should be disabled were disabled on 12/19/2018 and all of the Panzer tests in this build passed on 12/19/2018 and 12/20/2018.Description
As shown in this query the tests:
are failing in the build:
Test names above link to the test output
Current Status on CDash
The current status of these tests/builds for the current testing day can be found here
Steps to Reproduce
One should be able to reproduce this failure on waterman as described in:
More specifically, the commands given for waterman are provided at:
The exact commands to reproduce this issue should be:
The text was updated successfully, but these errors were encountered: