Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random failures due to jumbled output in TpetraCore_Bug7745_MPI_4 and TpetraCore_MultiVector_LocalViewTests_MPI_4 starting 2022-08-05? #10885

Closed
bartlettroscoe opened this issue Aug 15, 2022 · 10 comments
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/tpetra

Description

As shown in this query (click "Shown Matching Output" in upper right) the tests:

  • TpetraCore_Bug7745_MPI_4
  • TpetraCore_MultiVector_LocalViewTests_MPI_4

appear to be randomly failing in the builds:

  • PR-10801-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-331
  • PR-10802-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-423
  • PR-10808-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-325
  • PR-10808-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-794
  • PR-10834-test-rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables-409

starting testing day 2022-08-05.

The TpetraCore_Bug7745_MPI_4 test failues appear to be output jumbling the line End Result: TEST PASSED such as here showing:

Total Time: 0.0481 sec

Summary: total = 24, run = 24, passed = 24, failed = 0

End Result: 3 SUPERSET MAP
3 SUPERSET TO DEFAULT 
3 DEFAULT MAP
3 NOSAMES MAP
3 NOSAMES TO DEFAULT 
1 SUPERSET MAP
1 SUPERSET TO DEFAULT 
1 DEFAULT MAP
1 NOSAMES MAP
1 NOSAMES TO DEFAULT 
2 SUPERSET MAP
2 SUPERSET TO DEFAULT 
2 DEFAULT MAP
2 NOSAMES MAP
2 NOSAMES TO DEFAULT 
TEST PASSED

The test TpetraCore_MultiVector_LocalViewTests_MPI_4 fails are also technically due to jumbling the line End Result: TEST PASSED line such as here showing:


Total Time: 0.401 sec

Summary: total = 24, run = 24, passed = 24, failed = 0

End 1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view
1 caught exception trying to get a local view while holding a device view
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view 
1 caught exception trying to get a device view while holding a local view
1 caught exception trying to get a local view while holding a device view
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view
2 caught exception trying to get a local view while holding a device view
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view 
2 caught exception trying to get a device view while holding a local view
2 caught exception trying to get a local view while holding a device view
3 caught exception trying to get a device view while holding a local view
3 caught exception trying to get a local view while holding a device view
3 caught exception trying to get a device view while holding a local view 
3 caught exception trying to get a device view while holding a local view 
3 caught exception trying to get a device view while holding a local view 
3 caught exception trying to get a device view while holding a local view 
3 caught exception trying to get a device view while holding a local view
3 caught exception trying to get a local view while holding a device view
Result: TEST PASSED

In the above case, it seems that even though all of the unit tests claimed to pass, the fact that so many exceptions are being thrown, that might indicate that this test is actually defective and it is just dumb chance that these exception messages are jumbling the line End Result: TEST PASSED

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

Follow instructions at:

However, because these are random failures, triggering the failing test may be hard to do.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Tpetra impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Aug 15, 2022
@bartlettroscoe
Copy link
Member Author

@jhux2, @csiefer2, more seemingly randomly failing Tpetra tests to bring down PR builds ...

@bartlettroscoe
Copy link
Member Author

FYI: Note that that single randomly failing test TpetraCore_MultiVector_LocalViewTests_MPI_4 in the last PR build:

for PR #10802 shown here was the only failure in all of those PR builds that iteration as shown in this query.

So one impediment to getting PRs for Tpetra merged are Tpetra's own randomly failing tests.

@csiefer2
Copy link
Member

csiefer2 commented Aug 16, 2022

TpetraCore_MultiVector_LocalViewTests_MPI_4 is supposed to throw exceptions. That's the point. It's testing error cases. As far as I can tell from the output, the test is running correctly.

The problem is, as you noted, an output ordering munge. Why CUDA 11 suddenly makes it cry on the output is somewhat beyond me. But there you go. The CUDA 11 builds are wonky as all get out.

Turning the output off should be an easy fix if I can actually reproduce the issue. I just wasted the whole afternoon trying to reproduce the build only to have the reproducer refuse to build the Tpetra tests. The documentation for PR reproduction still hasn't been updated, so I suspect I'm doing it wrong again.

I haven't looked at the other one yet.

@csiefer2
Copy link
Member

And on the other machine the reproducer won't even configure correctly. Time to file another TRILINOSHD ticket...

@tasmith4
Copy link
Contributor

@csiefer2 I had independently noticed this yesterday and took a brief look -- at least in the case of MultiVector_LocalViewTests, the output that is overlapping with "End Result: TEST PASSED" is printed directly from catch blocks (i.e. direct cout statements, not the exception message itself). Historically, we've had a number of Tpetra tests that print output for informational purposes only, even when the test is passing. I thought we had fixed a lot of that because it has broken the PR tester before (and is of limited utility since you mostly want output when the test is failing). I'm pretty sure if we delete any informational output statements from the passing branch of the code, everything will work fine.

@tasmith4
Copy link
Contributor

@csiefer2 @bartlettroscoe I put up #10888 which removes the informational output and should resolve this issue.

@tasmith4
Copy link
Contributor

I also confirmed that the text of the informational output in question matches the statements reported by @bartlettroscoe to be overlapping with the "TEST PASSED" output.

@bartlettroscoe
Copy link
Member Author

TpetraCore_MultiVector_LocalViewTests_MPI_4 is supposed to throw exceptions. That's the point. It's testing error cases. As far as I can tell from the output, the test is running correctly.

But that test does not seem to check that those lines are even printed. It is only checking for End Result: TEST PASSED. So the test would still pass even if the exception messages were not printed.

@csiefer2
Copy link
Member

But that test does not seem to check that those lines are even printed. It is only checking for End Result: TEST PASSED. So the test would still pass even if the exception messages were not printed.

It doesn't have to check the output to work. The output is for human debugging purposes only. That's why @tasmith4 turned it off in #10888. We do actually know what we're doing here :)

@tasmith4
Copy link
Contributor

@bartlettroscoe #10888 has merged, removing the output reported in this issue, so this should now be resolved. Please reopen if you're still seeing problems with Tpetra test output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

3 participants