-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MueLu_ParameterListInterpreterXXX tests appear to be randomly failing with GCC 7.2.0 build #2311
Comments
@trilinos/muelu This looks harmless -- it seems like the test expects certain log output, and is only different because the sparse direct factorization has changed, not necessarily because the test actually failed. We would just need to change the MueLu "gold" log output file. |
I agree that this is harmless. MueLu's diff tool should already be ignoring the direct solver (notice the "", I'm curious why it did not in this case. |
If this is harmless, should we just disable this test for this automated build and only this automated build? If you do a local configure and run times, you would still see the failure though (which is good or bad depending on your primary concern). |
Did a MueLu developer fix this? |
It looks like it got fixed: The only new changes pull today in that build shown at: were
Could Dan's update of the YamlParser as part of #2306 and #2308 have accidentally fixed this test? |
Not that I know of. |
Looking at the query: which goes back to 1/30/2018, you can see that this test failed twice: once on 2/7/2018 and once on 2/28/2018. This fact, together with the error we saw in #2365 for the If you look at this query: and then sort by "Build Name", then "Test Name", then "Build Time", you can see that these two tests Therefore, I am going to change the title and scope of this story to match this. |
I went back and tried to reproduce the test failure If you follow the links and look at the configure output, you will see this is for the version of Trilinos d2683f3. Therefore, I tried to reproduce this on the machine
(see The test passed so it looks like indeed this was a non-deterministic failure. Could there be some file read/write race conditions in these tests? That is a common cause for randomly failing tests like this. |
@trilinos/muelu developers, Several of the I these failures should be sending email to the muelu-regressions email list. Therefore, to avoid spamming the MueLu developer (and myself) with these emails, we either need to fix these tests or disable them. Please let me know what approach you would like to take? Simply randomly letting them fail is not an option. |
Looks like this test randomly failed in another ATDM build shown at: which triggered a CDash email shown below. @trilinos/muelu developers, Please let me know how you want to address these seemingly randomly failing tests. I would just disable them in all of the ATDM builds but that might be avoiding testing some important functionality being used by ATDM applications. From: CDash A submission to CDash for the project Trilinos has failing tests. Details on the submission can be found at Project: Trilinos Tests failing -CDash on testing.sandia.gov |
@jhux2 (sometimes github is weird & doesn't get through e-mail filters) |
@bartlettroscoe I would love to fix these. However, it's unclear how we can do this ... since the failures are random. I am open to suggestions. I do not want these tests to be disabled. |
Do these tests involve file I/O where different tests could reading and writing the same files? If so, that can cause races that result in random failures like this. |
These tests involve file I/O, but the files are different for each test run. In other words, no two tests should be attempting to access the same file. |
This might help debug issue trilinos#2311.
This might help debug issue trilinos#2311.
@jhux2, okay thanks for the clarification on that. I just saw the commits 38c73d2 and bb78b26 that might address. If it does not, we will know of the following days. The issue is that we can't have any randomly failing tests because that will defeat any automated processes that require 100% passing tests (like update of Trilinos for SPARC and/or EMPIRE). |
This might help debug issue #2311. Build/Test Cases Summary Enabled Packages: MueLu Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=503,notpassed=0 (79.33 min) Other local commits for this build/test group: 9cf438d
Just pushed a small change that check the return value of the |
Looks like that change did not address the problem. We got another one of these failures in the test which showed the output:
|
Huh. This seems to suggest that in fact the external calls to |
Do these tests write a file and then expect to immediately have that file read? If that is the case, I have seen were a file does not get completely written before it gets read in again. We were having random failures like this for a test a few years ago with a CASL VERA test (TeuchosWrappersExt I think). I will have to dig some to see how I fixed that test but it had something to do with committing one of the files so that there was not a race to read the file right after it got written. |
@bartlettroscoe Yes, we dump the output of a run to file, and then immediately run |
Hmm, documentation says that |
Sierra tests also have seen this filesystem sync issue. @cgcgcg Why not use a pipe? Dump output to stdout (for example), then pipe through |
@mhoemmen The executable loops over a bunch of xml files, not just one. So I don't think we can use a pipe here. |
@mhoemmen Pipe would be fine for cat file2 | sed 's/<a>/<b>/' | diff file1 - So, as long as we don't need to pipe the first file too (just need to make sure it has @cgcgcg That's not a problem, we redirect output independently for each file. |
@aprokop Would something like |
@aprokop Dumping to a file independently is fine. Pipes are nice because they offer built-in synchronization guarantees. File systems more or less have no synch guarantees ;-) . |
@cgcgcg That's what I was thinking. The |
@cgcgcg Are you working on this? If not, I can take. |
@jhux2 No, I am not. Happy if you want to take this! |
FYI: As shown at: the test As shown at: one of these |
FYI: As shown at: the test And as shown at: one of these tests failed inthe builds This issue has been open for more than 2.5 months now. That is enough time to have fixed this issue, if it was going to be fixed. Therefore, it is time to disable these tests in the As long as the auto PR builds run these tests and other tests cover the core functionality of MueLu, then disabling these tests should not be so bad. But if core functionality of MueLu is only covered in these tests, then these should be fixed. |
Last update before I pull the trigger in disabling the randomly failing tests A shown in this query, since 4/1/2018, the MueLu "ParameterListInterpreter" tests
So out of 139 days it failed 24 times. That is not a huge ratio of fail/pass but if you combine several of these randomly failing test then the probability of having a failure (and blocking an auto promotion of a version of Trilinos to an ATDM app) goes up. We just can't tolerate randomly failing tests in these ATDM Trilinos builds. I am going to disable the tests |
…eavy]_MPI_1 on white/ride (trilinos#2311)
…eavy]_MPI_1 for SEMS GCC 7.2.0 build (trilinos#2311)
…eavy]_MPI_1 on white/ride (trilinos#2311)
…eavy]_MPI_1 for SEMS GCC 7.2.0 build (trilinos#2311)
CC: @srajama1 (Trilinos Leader Solvers Product Lead) FYI: These tests were disabled in this build in commit e872708 merged to 'develop' on 6/25/2018 as part of PR #3011. You can see that the tests I am adding the label "Disabled Tests". MueLu developers can now fix this offline on their own schedule if they desire to do so. |
@bartlettroscoe I followed the "Steps to reproduce", but cmake fails. It appears to be trying to use the Intel compiler. Since this is a gcc build, I'm not sure why. Modules
cmake error message
|
@jhux2, we disabled that build because it was using OpenMPI 1.6.5 and STK no longer supports that version of MPI. See #3390. Don't know why we never saw this test randomly fail in other builds. Feel free to close this issue as "wontfix" at this point. |
Closing as "wontfix". |
CC: @trilinos/muelu
Next Action Status
These tests were disabled in this build in commit e872708 merged to 'develop' on 6/25/2018 as part of PR #3011. Next: MueLu developers fix offline and then re-enable in this build if they desire ...
Description
The tests
MueLu_ParameterListInterpreterTpetra_MPI_1
andMueLu_ParameterListInterpreterTpetraHeavy_MPI_1
appear to be failing randomly from the GCC buildTrilinos-atdm-sems-gcc-7-2-0
as shown at:(sort by "Build Name", then "Test Name", then "Build Time").
For example, the test
MueLu_ParameterListInterpreterTpetra_MPI_1
is failed as shown at:It shows the failures:
Steps to Reproduce
Using the
do-configure
script:Anyone should be able to reproduce these builds and run these tests on any SNL COE RHEL6 machine as shown below:
However, given that tests seem to be randomly failing, it may be hard to reproduce these failures.
Related Issues
The text was updated successfully, but these errors were encountered: