-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620
Comments
I've been looking into the stokhos failures. They appear to be seg faults associated with calls to the LAPACK function STEQR through the Teuchos::LAPACK interface. Were the issues with BLAS/LAPACK ever resolved on this machine? |
As documented in #1208 last year, Tech-X contractors tracked this down to what looks like a compiler defect in passing some arguments between C++ and C or Fortran. But most of those failures appear to be occurring with compiler optimization turned on as evidenced by the numerous values we see in the 'opt' builds on this system as documented in #2454. But we did see an LAPACK test failing even in the 'debug' build as described in #2410. Given all of those failing Belos and Anasazi tests (which use BLAS and LAPACK) for the 'opt' builds, I don't understand how anyone is using the solvers in Trilinos on this system (but somehow they are). The solution we were told was to simply wait for the next version of ATS-2 which should be what exists on the new testbed machine waterman. That was the purpose of the placeholder issue #1675. Issues with that system can be discussed in the protected issue: Given all of that, I don't know that we should try too hard to address failures related to mixed-language calls on this current Power8 system on 'ride' and 'white' CC: @nmhamster |
So, you're saying I can ignore these failures? |
I am saying that we should disable the tests that look like they are related to this compiler defect. The only reason this given issue #2620 even exists is because the machine 'white' looks to be the only machine with enough nodes and free time to be able to run an auto PR CUDA build. We really should find (or purchase) better hardware with a more sane compiler and other software but it is what it is. |
I verified the stokhos tests are in fact failing due to the call to the DSTEQR LAPACK function. I downloaded the reference implementation of that function from netlib, renamed the function, changed stokhos to call that function, and compiled it with the rest of stokhos and all of the stokhos tests passed. So it is definitely some issue with calling routines in the BLAS/LAPACK library on that machine. So I guess you can disable the stokhos tests that fail on this machine for this testing. Please don't disable stokhos overall, since there are a few tests that don't go through this code path and still pass. |
That is interesting. It is possible there is something wrong with some of the LAPACK routines. The compiler bug identified #1208 was for an optimized build. But that does not mean there is not also a problem with the BLAS/LAPACK installed there. Perhaps that is why the LAPACK unit tests failed as described in #2410? @nmhamster, can we have the Test Bed team install the netlib BLAS and LAPACK on 'ride' and 'white' to see if we can reproduce @etphipp's results described above? This can be in addition to the current optimized BLAS and LAPACK installations. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
CC: @trilinos/framework
Description
This issue is a placeholder for cleaning up the builds and tests for all of the Primary Tested Trilinos packages for the
cuda-debug
build onwhite/ride
as described in #2464 (comment).As of 4/23/2018, the build
Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once
onwhite
for all 53 of the current Primary Tested Trilinos packages passed the configure but has several build and runtime test failures as shown at:This shows build failures for the packages:
And it shows test failures for the packages:
Steps to Reproduce
Anyone should be able to reproduce any of these package failures on 'white`'(SON) or 'ride' (SRN) as:
For more details, see:
The text was updated successfully, but these errors were encountered: