Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu_ParameterListInterpreterXXX tests appear to be randomly failing with GCC 7.2.0 build #2311

Closed
bartlettroscoe opened this issue Feb 28, 2018 · 37 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu resolved: wontfix The development team cannot or will not address this issue type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Feb 28, 2018

CC: @trilinos/muelu

Next Action Status

These tests were disabled in this build in commit e872708 merged to 'develop' on 6/25/2018 as part of PR #3011. Next: MueLu developers fix offline and then re-enable in this build if they desire ...

Description

The tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 appear to be failing randomly from the GCC build Trilinos-atdm-sems-gcc-7-2-0 as shown at:

(sort by "Build Name", then "Test Name", then "Build Time").

For example, the test MueLu_ParameterListInterpreterTpetra_MPI_1 is failed as shown at:

It shows the failures:

Testing: MLParameterListInterpreter/MLsmoother1.xml
Binary files Output/MLsmoother1_tpetra.gold_filtered and Output/MLsmoother1_tpetra.out_filtered differ

...


Testing: MLParameterListInterpreter/MLunsmoothed1.xml
--- Output/MLunsmoothed1_tpetra.gold_filtered	2018-02-28 09:24:11.570595286 -0700
+++ Output/MLunsmoothed1_tpetra.out_filtered	2018-02-28 09:24:11.582595434 -0700
@@ -280,7 +280,7 @@
  matrixmatrix: kernel params -> 
   [empty list]
  
- Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
+ Setup Smoother (MueLu::Amesos2Smoother{type = Klu})
  keep smoother data = 0   [default]
  PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
  PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
@@ -311,6 +311,6 @@
 
 Smoother (level 3) both : "Ifpack2::Relaxation": {Initialized: true, Computed: true, Type: Symmetric Gauss-Seidel, sweeps: 2, damping factor: 1, Global matrix dimensions: [371, 371], Global nnz: 1111}
 
-Smoother (level 4) pre  : <Direct> solver interface
+Smoother (level 4) pre  : KLU2 solver interface
 Smoother (level 4) post : no smoother
 
MLParameterListInterpreter/MLunsmoothed1.xml : failed
Testing: MLParameterListInterpreter/MLsmoother4.xml
--- Output/MLsmoother4_tpetra.gold_filtered	2018-02-28 09:24:12.279604031 -0700
+++ Output/MLsmoother4_tpetra.out_filtered	2018-02-28 09:24:12.277604006 -0700
@@ -284,7 +284,7 @@
  matrixmatrix: kernel params -> 
   [empty list]
  
- Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
+ Setup Smoother (MueLu::Amesos2Smoother{type = Klu})
  keep smoother data = 0   [default]
  PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
  PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
MLParameterListInterpreter/MLsmoother4.xml : failed

Steps to Reproduce

Using the do-configure script:

#!/bin/bash
cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/sems/atdm/SEMSATDMSettings.cmake,cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake \
-DDART_TESTING_TIMEOUT:STRING=300.0 \
-DTrilinos_ENABLE_TESTS:BOOL=ON \
-DCTEST_BUILD_FLAGS=-j10 \
-DCTEST_PARALLEL_LEVEL=10 \
"$@" \
$TRILINOS_DIR

Anyone should be able to reproduce these builds and run these tests on any SNL COE RHEL6 machine as shown below:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh

$ ./do-configure -DTrilinos_ENABLE_MueLu=ON

$ make -j16

$ ctest -j16

However, given that tests seem to be randomly failing, it may be hard to reproduce these failures.

Related Issues

@bartlettroscoe bartlettroscoe added pkg: MueLu client: ATDM Any issue primarily impacting the ATDM project labels Feb 28, 2018
@mhoemmen
Copy link
Contributor

@trilinos/muelu This looks harmless -- it seems like the test expects certain log output, and is only different because the sparse direct factorization has changed, not necessarily because the test actually failed. We would just need to change the MueLu "gold" log output file.

@jhux2
Copy link
Member

jhux2 commented Feb 28, 2018

I agree that this is harmless. MueLu's diff tool should already be ignoring the direct solver (notice the "", I'm curious why it did not in this case.

@bartlettroscoe
Copy link
Member Author

If this is harmless, should we just disable this test for this automated build and only this automated build? If you do a local configure and run times, you would still see the failure though (which is good or bad depending on your primary concern).

@bartlettroscoe
Copy link
Member Author

@jhux2,

Did a MueLu developer fix this?

@bartlettroscoe
Copy link
Member Author

It looks like it got fixed:

The only new changes pull today in that build shown at:

were

2797a9a:  Merge pull request #2308 from ibaned/issue-2306
Author: Dan Ibanez <[email protected]>
Date:   Wed Feb 28 10:37:21 2018 -0700

dded496:  Teuchos: put back Array, remove cpp scope guard
Author: Dan Ibanez <[email protected]>
Date:   Wed Feb 28 08:09:25 2018 -0700

M	packages/teuchos/parameterlist/src/Teuchos_YamlParser.cpp

29def6b:  Teuchos: throw exception at duplicate YAML names
Author: Dan Ibanez <[email protected]>
Date:   Tue Feb 27 14:40:16 2018 -0700

M	packages/teuchos/parameterlist/src/Teuchos_YamlParser.cpp
M	packages/teuchos/parameterlist/test/yaml/YamlParameterList.cpp

Could Dan's update of the YamlParser as part of #2306 and #2308 have accidentally fixed this test?

@jhux2
Copy link
Member

jhux2 commented Mar 7, 2018

Did a MueLu developer fix this?

Not that I know of.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 13, 2018

Looking at the query:

which goes back to 1/30/2018, you can see that this test failed twice: once on 2/7/2018 and once on 2/28/2018. This fact, together with the error we saw in #2365 for the MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 test failure that magically went away seems to suggest that these ``MueLu_ParameterListInterpreterXXX` tests might be a big fragile and subject to non-deterministic failures.

If you look at this query:

and then sort by "Build Name", then "Test Name", then "Build Time", you can see that these two tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 do appear to be failing randomly with this GCC 7.2.0 build.

Therefore, I am going to change the title and scope of this story to match this.

@bartlettroscoe
Copy link
Member Author

I went back and tried to reproduce the test failure MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 shown at:

If you follow the links and look at the configure output, you will see this is for the version of Trilinos d2683f3. Therefore, I tried to reproduce this on the machine ceerws1113 by doing:

$ cd /scratch/rabartl/Trilinos.base/Trilinos/

$ git checkout d2683f3
...
HEAD is now at d2683f3... Tpetra: Adding import chaining utility (for use with MMM)

$ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/

$ source /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh

$ rm -r CMake*

$ ./do-configure -DTrilinos_ENABLE_MueLu=ON &> configure.out

$ make -j16

$ ctest -R MueLu_ParameterListInterpreterTpetraHeavy_MPI_1
...
1/1 Test #40: MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 ...   Passed   67.51 sec

100% tests passed, 0 tests failed out of 1

Subproject Time Summary:
MueLu    =  67.51 sec*proc (1 test)

Total Test time (real) =  68.53 sec

(see do-configure script above in first comment).

The test passed so it looks like indeed this was a non-deterministic failure.

Could there be some file read/write race conditions in these tests? That is a common cause for randomly failing tests like this.

@bartlettroscoe bartlettroscoe changed the title MueLu_ParameterListInterpreterTpetra_MPI_1 newly failing in GCC 7.2.0 build starting 2/28/2018 MueLu_ParameterListInterpreterXXX tests appear to be randomly failing with GCC 7.2.0 build Mar 15, 2018
@bartlettroscoe bartlettroscoe added stage: in progress Work on the issue has started type: bug The primary issue is a bug in Trilinos code or tests and removed type: bug The primary issue is a bug in Trilinos code or tests labels Mar 15, 2018
@bartlettroscoe
Copy link
Member Author

@trilinos/muelu developers,

Several of the MueLu_ParameterListInterpreterXXX tests continue to fail what looks like randomly in the build Trilinos-atdm-sems-gcc-7-2-0 over the last several week as shown in the query:

I these failures should be sending email to the muelu-regressions email list.

Therefore, to avoid spamming the MueLu developer (and myself) with these emails, we either need to fix these tests or disable them.

Please let me know what approach you would like to take? Simply randomly letting them fail is not an option.

@bartlettroscoe
Copy link
Member Author

Looks like this test randomly failed in another ATDM build shown at:

which triggered a CDash email shown below.

@trilinos/muelu developers,

Please let me know how you want to address these seemingly randomly failing tests. I would just disable them in all of the ATDM builds but that might be avoiding testing some important functionality being used by ATDM applications.


From: CDash
Sent: Friday, April 06, 2018 3:56 AM
To: Bartlett, Roscoe A
Subject: FAILED (t=1): Trilinos/MueLu - Trilinos-atdm-white-ride-gnu-debug-openmp - ATDM

A submission to CDash for the project Trilinos has failing tests.
You have been identified as one of the authors who have checked in changes
that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at
https://testing.sandia.gov/cdash/buildSummary.php?buildid=3485357

Project: Trilinos
SubProject: MueLu
Site: white
Build Name: Trilinos-atdm-white-ride-gnu-debug-openmp
Build Time: 2018-04-06T07:46:04 UTC
Type: ATDM
Tests failing: 1

Tests failing
MueLu_ParameterListInterpreterTpetra_MPI_1(https://testing.sandia.gov/cdash/testDetails.php?test=46282018&build=3485357)

-CDash on testing.sandia.gov

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 7, 2018

@jhux2 (sometimes github is weird & doesn't get through e-mail filters)

@jhux2
Copy link
Member

jhux2 commented Apr 9, 2018

Please let me know how you want to address these seemingly randomly failing tests. I would just disable them in all of the ATDM builds but that might be avoiding testing some important functionality being used by ATDM applications.

@bartlettroscoe I would love to fix these. However, it's unclear how we can do this ... since the failures are random. I am open to suggestions.

I do not want these tests to be disabled.

@bartlettroscoe
Copy link
Member Author

@jhux2,

Do these tests involve file I/O where different tests could reading and writing the same files? If so, that can cause races that result in random failures like this.

@jhux2
Copy link
Member

jhux2 commented Apr 9, 2018

These tests involve file I/O, but the files are different for each test run. In other words, no two tests should be attempting to access the same file.

cgcgcg added a commit to cgcgcg/Trilinos that referenced this issue Apr 9, 2018
cgcgcg added a commit to cgcgcg/Trilinos that referenced this issue Apr 9, 2018
@bartlettroscoe
Copy link
Member Author

These tests involve file I/O, but the files are different for each test run. In other words, no two tests should be attempting to access the same file.

@jhux2, okay thanks for the clarification on that.

I just saw the commits 38c73d2 and bb78b26 that might address. If it does not, we will know of the following days.

The issue is that we can't have any randomly failing tests because that will defeat any automated processes that require 100% passing tests (like update of Trilinos for SPARC and/or EMPIRE).

cgcgcg added a commit that referenced this issue Apr 9, 2018
This might help debug issue #2311.

Build/Test Cases Summary
Enabled Packages: MueLu
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=503,notpassed=0 (79.33 min)
Other local commits for this build/test group: 9cf438d
@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 9, 2018

Just pushed a small change that check the return value of the sed call, maybe this can help to check what's actually happening..

@bartlettroscoe
Copy link
Member Author

Just pushed a small change that check the return value of the sed call, maybe this can help to check what's actually happening..

Looks like that change did not address the problem. We got another one of these failures in the test MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 for the build Trilinos-atdm-sems-gcc-7-2-0 shown at:

which showed the output:

...
MLParameterListInterpreter/MLcoarse4.xml : passed
Testing: MLParameterListInterpreter/MLcoarse5.xml
MLParameterListInterpreter/MLcoarse5.xml : passed
Testing: MLParameterListInterpreter/MLempty.xml
Warning: comparison file Output/MLempty_tpetra.gold not found.  Skipping test
Testing: MLParameterListInterpreter/MLpgamg1.xml
MLParameterListInterpreter/MLpgamg1.xml : passed
Testing: MLParameterListInterpreter/MLpgamg2.xml
Warning: comparison file Output/MLpgamg2_tpetra.gold not found.  Skipping test
Testing: MLParameterListInterpreter/MLrepartitioning1.xml
Binary files Output/MLrepartitioning1_tpetra.gold_filtered and Output/MLrepartitioning1_tpetra.out_filtered differ
MLParameterListInterpreter/MLrepartitioning1.xml : failed
Testing: MLParameterListInterpreter/MLrepartitioning2.xml
MLParameterListInterpreter/MLrepartitioning2.xml : passed
Testing: MLParameterListInterpreter/MLrepartitioning3.xml
MLParameterListInterpreter/MLrepartitioning3.xml : passed
Testing: MLParameterListInterpreter/MLsmoother1.xml
MLParameterListInterpreter/MLsmoother1.xml : passed
Testing: MLParameterListInterpreter/MLsmoother2.xml
MLParameterListInterpreter/MLsmoother2.xml : passed
Testing: MLParameterListInterpreter/MLsmoother3.xml
MLParameterListInterpreter/MLsmoother3.xml : passed
Testing: MLParameterListInterpreter/MLsmoother4.xml
--- Output/MLsmoother4_tpetra.gold_filtered	2018-04-11 05:37:19.446680766 -0600
+++ Output/MLsmoother4_tpetra.out_filtered	2018-04-11 05:37:19.453680853 -0600
@@ -1,320 +0,0 @@
-Level 0
- Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV})
- keep smoother data = 0   [default]
- PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- smoother -> 
-  chebyshev: degree = 2
-  chebyshev: ratio eigenvalue = 20
-  A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
-  chebyshev: boost factor = 1.1   [unused]
-  chebyshev: min diagonal value = 2.22045e-16   [default]
-  chebyshev: eigenvalue max iterations = 10   [default]
-  chebyshev: zero starting solution = 1   [default]
-  chebyshev: assume matrix does not change = 0   [default]
- 
-Level 1
- Prolongator smoothing (MueLu::SaPFactory)
-  Build (MueLu::TentativePFactory)
-   Build (MueLu::UncoupledAggregationFactory)
-    Build (MueLu::CoalesceDropFactory)
-     Build (MueLu::AmalgamationFactory)
-     [empty list]
-     
-    aggregation: drop tol = 0   [default]
-    aggregation: Dirichlet threshold = 0   [default]
-    aggregation: drop scheme = classical   [default]
-    lightweight wrap = 1   [default]
-    
-   aggregation: max agg size = -1   [default]
-   aggregation: min agg size = 1   [unused]
-   aggregation: max selected neighbors = 0   [unused]
-   aggregation: ordering = natural   [unused]
-   aggregation: enable phase 1 = 1   [default]
-   aggregation: enable phase 2a = 1   [default]
-   aggregation: enable phase 2b = 1   [default]
-   aggregation: enable phase 3 = 1   [default]
-   aggregation: preserve Dirichlet points = 0   [unused]
-   aggregation: allow user-specified singletons = 0   [default]
-   aggregation: error on nodes with no on-rank neighbors = 0   [default]
-   OnePt aggregate map name =    [default]
-   OnePt aggregate map factory =    [default]
-   
-   Build (MueLu::CoarseMapFactory)
-   Striding info = {}   [default]
-   Strided block id = -1   [default]
-   Domain GID offsets = {0}   [default]
-   
-  tentative: calculate qr = 1   [default]
-  matrixmatrix: kernel params -> 
-   [empty list]
-  
- sa: damping factor = 1.33333
- sa: calculate eigenvalue estimate = 0   [default]
- sa: eigenvalue estimate num iterations = 10   [default]
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Transpose P (MueLu::TransPFactory)
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Computing Ac (MueLu::RAPFactory)
- transpose: use implicit = 0   [default]
- rap: triple product = 0   [default]
- rap: fix zero diagonals = 0   [default]
- CheckMainDiagonal = 0   [default]
- RepairMainDiagonal = 0
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV})
- keep smoother data = 0   [default]
- PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- smoother -> 
-  chebyshev: degree = 2
-  chebyshev: ratio eigenvalue = 20
-  A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
-  chebyshev: boost factor = 1.1   [unused]
-  chebyshev: min diagonal value = 2.22045e-16   [default]
-  chebyshev: eigenvalue max iterations = 10   [default]
-  chebyshev: zero starting solution = 1   [default]
-  chebyshev: assume matrix does not change = 0   [default]
- 
-Level 2
- Prolongator smoothing (MueLu::SaPFactory)
-  Build (MueLu::TentativePFactory)
-   Build (MueLu::UncoupledAggregationFactory)
-    Build (MueLu::CoalesceDropFactory)
-     Build (MueLu::AmalgamationFactory)
-     [empty list]
-     
-    aggregation: drop tol = 0   [default]
-    aggregation: Dirichlet threshold = 0   [default]
-    aggregation: drop scheme = classical   [default]
-    lightweight wrap = 1   [default]
-    
-   aggregation: max agg size = -1   [default]
-   aggregation: min agg size = 1   [unused]
-   aggregation: max selected neighbors = 0   [unused]
-   aggregation: ordering = natural   [unused]
-   aggregation: enable phase 1 = 1   [default]
-   aggregation: enable phase 2a = 1   [default]
-   aggregation: enable phase 2b = 1   [default]
-   aggregation: enable phase 3 = 1   [default]
-   aggregation: preserve Dirichlet points = 0   [unused]
-   aggregation: allow user-specified singletons = 0   [default]
-   aggregation: error on nodes with no on-rank neighbors = 0   [default]
-   OnePt aggregate map name =    [default]
-   OnePt aggregate map factory =    [default]
-   
-   Nullspace factory (MueLu::NullspaceFactory)
-   Fine level nullspace = Nullspace
-   
-   Build (MueLu::CoarseMapFactory)
-   Striding info = {}   [default]
-   Strided block id = -1   [default]
-   Domain GID offsets = {0}   [default]
-   
-  tentative: calculate qr = 1   [default]
-  matrixmatrix: kernel params -> 
-   [empty list]
-  
- sa: damping factor = 1.33333
- sa: calculate eigenvalue estimate = 0   [default]
- sa: eigenvalue estimate num iterations = 10   [default]
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Transpose P (MueLu::TransPFactory)
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Computing Ac (MueLu::RAPFactory)
- transpose: use implicit = 0   [default]
- rap: triple product = 0   [default]
- rap: fix zero diagonals = 0   [default]
- CheckMainDiagonal = 0   [default]
- RepairMainDiagonal = 0
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV})
- keep smoother data = 0   [default]
- PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- smoother -> 
-  chebyshev: degree = 2
-  chebyshev: ratio eigenvalue = 20
-  A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
-  chebyshev: boost factor = 1.1   [unused]
-  chebyshev: min diagonal value = 2.22045e-16   [default]
-  chebyshev: eigenvalue max iterations = 10   [default]
-  chebyshev: zero starting solution = 1   [default]
-  chebyshev: assume matrix does not change = 0   [default]
- 
-Level 3
- Prolongator smoothing (MueLu::SaPFactory)
-  Build (MueLu::TentativePFactory)
-   Build (MueLu::UncoupledAggregationFactory)
-    Build (MueLu::CoalesceDropFactory)
-     Build (MueLu::AmalgamationFactory)
-     [empty list]
-     
-    aggregation: drop tol = 0   [default]
-    aggregation: Dirichlet threshold = 0   [default]
-    aggregation: drop scheme = classical   [default]
-    lightweight wrap = 1   [default]
-    
-   aggregation: max agg size = -1   [default]
-   aggregation: min agg size = 1   [unused]
-   aggregation: max selected neighbors = 0   [unused]
-   aggregation: ordering = natural   [unused]
-   aggregation: enable phase 1 = 1   [default]
-   aggregation: enable phase 2a = 1   [default]
-   aggregation: enable phase 2b = 1   [default]
-   aggregation: enable phase 3 = 1   [default]
-   aggregation: preserve Dirichlet points = 0   [unused]
-   aggregation: allow user-specified singletons = 0   [default]
-   aggregation: error on nodes with no on-rank neighbors = 0   [default]
-   OnePt aggregate map name =    [default]
-   OnePt aggregate map factory =    [default]
-   
-   Nullspace factory (MueLu::NullspaceFactory)
-   Fine level nullspace = Nullspace
-   
-   Build (MueLu::CoarseMapFactory)
-   Striding info = {}   [default]
-   Strided block id = -1   [default]
-   Domain GID offsets = {0}   [default]
-   
-  tentative: calculate qr = 1   [default]
-  matrixmatrix: kernel params -> 
-   [empty list]
-  
- sa: damping factor = 1.33333
- sa: calculate eigenvalue estimate = 0   [default]
- sa: eigenvalue estimate num iterations = 10   [default]
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Transpose P (MueLu::TransPFactory)
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Computing Ac (MueLu::RAPFactory)
- transpose: use implicit = 0   [default]
- rap: triple product = 0   [default]
- rap: fix zero diagonals = 0   [default]
- CheckMainDiagonal = 0   [default]
- RepairMainDiagonal = 0
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV})
- keep smoother data = 0   [default]
- PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- smoother -> 
-  chebyshev: degree = 2
-  chebyshev: ratio eigenvalue = 20
-  A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
-  chebyshev: boost factor = 1.1   [unused]
-  chebyshev: min diagonal value = 2.22045e-16   [default]
-  chebyshev: eigenvalue max iterations = 10   [default]
-  chebyshev: zero starting solution = 1   [default]
-  chebyshev: assume matrix does not change = 0   [default]
- 
-Level 4
- Prolongator smoothing (MueLu::SaPFactory)
-  Build (MueLu::TentativePFactory)
-   Build (MueLu::UncoupledAggregationFactory)
-    Build (MueLu::CoalesceDropFactory)
-     Build (MueLu::AmalgamationFactory)
-     [empty list]
-     
-    aggregation: drop tol = 0   [default]
-    aggregation: Dirichlet threshold = 0   [default]
-    aggregation: drop scheme = classical   [default]
-    lightweight wrap = 1   [default]
-    
-   aggregation: max agg size = -1   [default]
-   aggregation: min agg size = 1   [unused]
-   aggregation: max selected neighbors = 0   [unused]
-   aggregation: ordering = natural   [unused]
-   aggregation: enable phase 1 = 1   [default]
-   aggregation: enable phase 2a = 1   [default]
-   aggregation: enable phase 2b = 1   [default]
-   aggregation: enable phase 3 = 1   [default]
-   aggregation: preserve Dirichlet points = 0   [unused]
-   aggregation: allow user-specified singletons = 0   [default]
-   aggregation: error on nodes with no on-rank neighbors = 0   [default]
-   OnePt aggregate map name =    [default]
-   OnePt aggregate map factory =    [default]
-   
-   Nullspace factory (MueLu::NullspaceFactory)
-   Fine level nullspace = Nullspace
-   
-   Build (MueLu::CoarseMapFactory)
-   Striding info = {}   [default]
-   Strided block id = -1   [default]
-   Domain GID offsets = {0}   [default]
-   
-  tentative: calculate qr = 1   [default]
-  matrixmatrix: kernel params -> 
-   [empty list]
-  
- sa: damping factor = 1.33333
- sa: calculate eigenvalue estimate = 0   [default]
- sa: eigenvalue estimate num iterations = 10   [default]
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Transpose P (MueLu::TransPFactory)
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Computing Ac (MueLu::RAPFactory)
- transpose: use implicit = 0   [default]
- rap: triple product = 0   [default]
- rap: fix zero diagonals = 0   [default]
- CheckMainDiagonal = 0   [default]
- RepairMainDiagonal = 0
- matrixmatrix: kernel params -> 
-  [empty list]
- 
- Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
- keep smoother data = 0   [default]
- PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- smoother -> 
-  A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
- 
-
---------------------------------------------------------------------------------
----                            Multigrid Summary                             ---
---------------------------------------------------------------------------------
-Number of levels    = 5
-Operator complexity = 1.49
-Smoother complexity = 1.98
-Cycle type          = V
-
-level  rows    nnz  nnz/row  c ratio  procs
-  0    9999  29995     3.00               1
-  1    3333   9997     3.00     3.00      1
-  2    1111   3331     3.00     3.00      1
-  3     371   1111     2.99     2.99      1
-  4     124    370     2.98     2.99      1
-
-Smoother (level 0) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 2, lambdaMax = <ignored>, alpha: 20, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: [9999, 9999], Global nnz: 29995}
-
-Smoother (level 1) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 2, lambdaMax = <ignored>, alpha: 20, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: [3333, 3333], Global nnz: 9997}
-
-Smoother (level 2) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 2, lambdaMax = <ignored>, alpha: 20, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: [1111, 1111], Global nnz: 3331}
-
-Smoother (level 3) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 2, lambdaMax = <ignored>, alpha: 20, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: [371, 371], Global nnz: 1111}
-
-Smoother (level 4) pre  : <Direct> solver interface
-Smoother (level 4) post : no smoother
-
MLParameterListInterpreter/MLsmoother4.xml : failed
Testing: MLParameterListInterpreter/MLunsmoothed1.xml
MLParameterListInterpreter/MLunsmoothed1.xml : passed
...

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 11, 2018

Huh. This seems to suggest that in fact the external calls to sed and diff did go through, but that we somehow sometimes produce an empty output file?

@bartlettroscoe
Copy link
Member Author

Huh. This seems to suggest that in fact the external calls to sed and diff did go through, but that we somehow sometimes produce an empty output file?

@cgcgcg,

Do these tests write a file and then expect to immediately have that file read? If that is the case, I have seen were a file does not get completely written before it gets read in again. We were having random failures like this for a test a few years ago with a CASL VERA test (TeuchosWrappersExt I think). I will have to dig some to see how I fixed that test but it had something to do with committing one of the files so that there was not a race to read the file right after it got written.

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 11, 2018

@bartlettroscoe Yes, we dump the output of a run to file, and then immediately run sed and diff over that

@aprokop
Copy link
Contributor

aprokop commented Apr 11, 2018

Hmm, documentation says that filebuf should synchronize the contents to file on close(). There is also an explicit sync call. But I wonder if it's the OS the holds the content of the file for some time before writing it out. I wonder if we can find a call somewhere to explicitly flush things out. This may be the reason for race condition, as output seems to suggest that the second file is simply absent.

@mhoemmen
Copy link
Contributor

Sierra tests also have seen this filesystem sync issue.

@cgcgcg Why not use a pipe? Dump output to stdout (for example), then pipe through sed.

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 11, 2018

@mhoemmen The executable loops over a bunch of xml files, not just one. So I don't think we can use a pipe here.

@aprokop
Copy link
Contributor

aprokop commented Apr 11, 2018

@mhoemmen Pipe would be fine for sed but then we also call diff on two files. I know that diff can do diff <file1 <file2 but not with piping.
Hmm, it actually may work:

cat file2 | sed 's/<a>/<b>/' | diff file1 -

So, as long as we don't need to pipe the first file too (just need to make sure it has <ignored> in proper places), Mark's suggestion may work. One thing to remember, though, it that we need retain the option to dump the output content into a file, as we sometimes need to update the gold files.

@cgcgcg That's not a problem, we redirect output independently for each file.

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 11, 2018

@aprokop Would something like diff existing.gold - work?

@mhoemmen
Copy link
Contributor

@aprokop Dumping to a file independently is fine. Pipes are nice because they offer built-in synchronization guarantees. File systems more or less have no synch guarantees ;-) .

@aprokop
Copy link
Contributor

aprokop commented Apr 11, 2018

@cgcgcg That's what I was thinking. The - would allow to take in input from result piping through sed.

@jhux2
Copy link
Member

jhux2 commented Apr 16, 2018

@cgcgcg Are you working on this? If not, I can take.

@cgcgcg
Copy link
Contributor

cgcgcg commented Apr 16, 2018

@jhux2 No, I am not. Happy if you want to take this!

@bartlettroscoe
Copy link
Member Author

FYI:

As shown at:

the test MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 failed again yesterday for the build
Trilinos-atdm-sems-gcc-7-2-0.

As shown at:

one of these MueLu_ParameterListInterpreterXXX tests have failed 12 times in just the "Trilinos-atdm-sems-gcc-7-2-0" builds since 2/7/2018. That is 12 times in about 3 months.

@bartlettroscoe
Copy link
Member Author

FYI:

As shown at:

the test MueLu_ParameterListInterpreterTptra_MPI_1 or MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 failed four times in the last 15 days, just in this one Trilinos-atdm-sems-gcc-7-2-0 build.

And as shown at:

one of these tests failed inthe builds Trilinos-atdm-white-ride-cuda-debug and Trilinos-atdm-white-ride-cuda-opt on white and ride well several times in the last 15 days.

This issue has been open for more than 2.5 months now. That is enough time to have fixed this issue, if it was going to be fixed. Therefore, it is time to disable these tests in the Trilinos-atdm-sems-gcc-7-2-0, Trilinos-atdm-white-ride-cuda-debug and Trilinos-atdm-white-ride-cuda-opt. And if this tests is seen to fail in the same way in other builds, we will disable it in those builds as well.

As long as the auto PR builds run these tests and other tests cover the core functionality of MueLu, then disabling these tests should not be so bad. But if core functionality of MueLu is only covered in these tests, then these should be fixed.

@bartlettroscoe
Copy link
Member Author

Last update before I pull the trigger in disabling the randomly failing tests MueLu_ParameterListInterpreterTptra_MPI_1 or MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 in a few builds ...

A shown in this query, since 4/1/2018, the MueLu "ParameterListInterpreter" tests MueLu_ParameterListInterpreterTptra_MPI_1 or MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 failed 24 times in the builds:

  • Trilinos-atdm-sems-gcc-7-2-0
  • Trilinos-atdm-white-ride-cuda-debug
  • Trilinos-atdm-white-ride-cuda-opt

So out of 139 days it failed 24 times. That is not a huge ratio of fail/pass but if you combine several of these randomly failing test then the probability of having a failure (and blocking an auto promotion of a version of Trilinos to an ATDM app) goes up. We just can't tolerate randomly failing tests in these ATDM Trilinos builds.

I am going to disable the tests MueLu_ParameterListInterpreterTptra_MPI_1 and MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 in these builds now.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 22, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 23, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 23, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Jun 23, 2018
@bartlettroscoe
Copy link
Member Author

CC: @srajama1 (Trilinos Leader Solvers Product Lead)

FYI: These tests were disabled in this build in commit e872708 merged to 'develop' on 6/25/2018 as part of PR #3011.

You can see that the tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetraHeavy_MPI_1 do not appear in the set of MueLu tests run on this build yesterday as shown in this query. NOte that the "MPI_4" version of these tests are running and passing so it does not seem like much of a loss to not run the "MPI_1" version of these tests. This is not even a production ATDM platform. It is just a GCC 7.2.0 build to try to help catch Trilinos issues with this compiler, independent of CUDA.

I am adding the label "Disabled Tests". MueLu developers can now fix this offline on their own schedule if they desire to do so.

@bartlettroscoe bartlettroscoe added Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue and removed stage: in progress Work on the issue has started labels Jul 13, 2018
@jhux2
Copy link
Member

jhux2 commented Oct 9, 2018

@bartlettroscoe I followed the "Steps to reproduce", but cmake fails. It appears to be trying to use the Intel compiler. Since this is a gcc build, I'm not sure why.

Modules
(.../jhu/trilinos/build-github-issue-2311) module list
Currently Loaded Modulefiles:
  1) sems-env                   5) sems-git/2.10.1            9) atdm-zlib/1.2.8/atdm      13) atdm-scotch/6.0.3/atdm
  2) atdm-env                   6) atdm-gcc/7.2.0            10) atdm-hdf5/1.8.12/atdm     14) atdm-openblas/0.2.20
  3) sems-python/2.7.9          7) atdm-openmpi/1.6.5/atdm   11) atdm-netcdf/4.4.1/atdm    15) atdm-superlu/4.3/atdm
  4) atdm-cmake/3.11.1          8) atdm-boost/1.63.0/atdm    12) atdm-parmetis/4.0.3/atdm
cmake error message
-- Check for working C compiler: /projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/openmpi/1.6.5/bin/mpicc
-- Check for working C compiler: /projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/openmpi/1.6.5/bin/mpicc -- broken
CMake Error at /projects/sems/install/rhel6-x86_64/atdm/binary-install/cmake-3.11.1-Linux-x86_64/share/cmake-3.11/Modules/CMakeTestCCompiler.cmake:52 (message):
  The C compiler

    "/projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/openmpi/1.6.5/bin/mpicc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /scratch/jhu/trilinos/build-github-issue-2311/CMakeFiles/CMakeTmp
    
    Run Build Command:"/ascldap/users/jhu/bin/ninja" "cmTC_ea94b"
    [1/2] Building C object CMakeFiles/cmTC_ea94b.dir/testCCompiler.c.o
    FAILED: CMakeFiles/cmTC_ea94b.dir/testCCompiler.c.o 
    /projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/openmpi/1.6.5/bin/mpicc    -o CMakeFiles/cmTC_ea94b.dir/testCCompiler.c.o   -c testCCompiler.c
    
    Error: A license for Comp-CL could not be obtained (-1,359,2).
    
    Is your license file in the right location and readable?
    The location of your license file should be specified via
    the $INTEL_LICENSE_FILE environment variable.
    
    License file(s) used were (in this order):
        1.  Trusted Storage
    **  2.  /projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/base/compilers_and_libraries_2017.1.132/linux/bin/intel64/../../Licenses
    **  3.  /ascldap/users/jhu/Licenses
    **  4.  /opt/intel/licenses
    **  5.  /Users/Shared/Library/Application Support/Intel/Licenses
    **  6.  /projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/base/compilers_and_libraries_2017.1.132/linux/bin/intel64/*.lic
    
    Please refer http://software.intel.com/sites/support/ for more information..
    
    icc: error #10052: could not checkout FLEXlm license
    ninja: build stopped: subcommand failed.
    

  

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:1815 (ENABLE_LANGUAGE)
  cmake/tribits/core/package_arch/TribitsProjectImpl.cmake:188 (TRIBITS_SETUP_ENV)
  cmake/tribits/core/package_arch/TribitsProject.cmake:93 (TRIBITS_PROJECT_IMPL)
  CMakeLists.txt:90 (TRIBITS_PROJECT)


-- Configuring incomplete, errors occurred!
See also "/scratch/jhu/trilinos/build-github-issue-2311/CMakeFiles/CMakeOutput.log".
See also "/scratch/jhu/trilinos/build-github-issue-2311/CMakeFiles/CMakeError.log".

@bartlettroscoe
Copy link
Member Author

I followed the "Steps to reproduce", but cmake fails. It appears to be trying to use the Intel compiler. Since this is a gcc build, I'm not sure why.

@jhux2, we disabled that build because it was using OpenMPI 1.6.5 and STK no longer supports that version of MPI. See #3390.

Don't know why we never saw this test randomly fail in other builds.

Feel free to close this issue as "wontfix" at this point.

@jhux2 jhux2 added the resolved: wontfix The development team cannot or will not address this issue label Oct 9, 2018
@jhux2
Copy link
Member

jhux2 commented Oct 9, 2018

Closing as "wontfix".

@jhux2 jhux2 closed this as completed Oct 9, 2018
@bartlettroscoe bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Nov 30, 2018
@jhux2 jhux2 added this to MueLu Aug 12, 2024
@jhux2 jhux2 moved this to Done in MueLu Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project Disabled Tests Issue has been partially addressed by disabling *all* of the failing tests related to the issue PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu resolved: wontfix The development team cannot or will not address this issue type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

5 participants