Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620

bartlettroscoe · 2018-04-23T17:07:06Z

CC: @trilinos/framework

Description

This issue is a placeholder for cleaning up the builds and tests for all of the Primary Tested Trilinos packages for the cuda-debug build on white/ride as described in #2464 (comment).

As of 4/23/2018, the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once on white for all 53 of the current Primary Tested Trilinos packages passed the configure but has several build and runtime test failures as shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3451394

This shows build failures for the packages:

ROL (@trilinos/rol)
MiniTensor (@lxmota)
Zoltan (@trilinos/zoltan)
TrilinosCouplings (???)
ShyLU_DD (@trilinos/shylu )
Domi (@trilinos/domi)

And it shows test failures for the packages:

Stokhos (@trilinos/stokhos)
ROL
Zoltan
TrilinosCouplings
Piro (@trilinos/piro)
FEI (@trilinos/fei)

Steps to Reproduce

Anyone should be able to reproduce any of these package failures on 'white`'(SON) or 'ride' (SRN) as:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettungs.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<PackageName>=ON \
  $TRILINOS_DIR

$ ninja -j16 -k 999999

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

For more details, see:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

The text was updated successfully, but these errors were encountered:

etphipp · 2018-04-25T18:19:49Z

I've been looking into the stokhos failures. They appear to be seg faults associated with calls to the LAPACK function STEQR through the Teuchos::LAPACK interface. Were the issues with BLAS/LAPACK ever resolved on this machine?

bartlettroscoe · 2018-04-25T18:55:59Z

I've been looking into the stokhos failures. They appear to be seg faults associated with calls to the LAPACK function STEQR through the Teuchos::LAPACK interface. Were the issues with BLAS/LAPACK ever resolved on this machine?

@etphipp,

As documented in #1208 last year, Tech-X contractors tracked this down to what looks like a compiler defect in passing some arguments between C++ and C or Fortran. But most of those failures appear to be occurring with compiler optimization turned on as evidenced by the numerous values we see in the 'opt' builds on this system as documented in #2454. But we did see an LAPACK test failing even in the 'debug' build as described in #2410. Given all of those failing Belos and Anasazi tests (which use BLAS and LAPACK) for the 'opt' builds, I don't understand how anyone is using the solvers in Trilinos on this system (but somehow they are).

The solution we were told was to simply wait for the next version of ATS-2 which should be what exists on the new testbed machine waterman. That was the purpose of the placeholder issue #1675. Issues with that system can be discussed in the protected issue:

https://sems-atlassian-srn.sandia.gov/browse/CDOFA-24

Given all of that, I don't know that we should try too hard to address failures related to mixed-language calls on this current Power8 system on 'ride' and 'white'

CC: @nmhamster

etphipp · 2018-04-25T19:18:34Z

So, you're saying I can ignore these failures?

bartlettroscoe · 2018-04-25T19:22:07Z

So, you're saying I can ignore these failures?

I am saying that we should disable the tests that look like they are related to this compiler defect.

The only reason this given issue #2620 even exists is because the machine 'white' looks to be the only machine with enough nodes and free time to be able to run an auto PR CUDA build. We really should find (or purchase) better hardware with a more sane compiler and other software but it is what it is.

etphipp · 2018-04-25T22:11:29Z

I verified the stokhos tests are in fact failing due to the call to the DSTEQR LAPACK function. I downloaded the reference implementation of that function from netlib, renamed the function, changed stokhos to call that function, and compiled it with the rest of stokhos and all of the stokhos tests passed. So it is definitely some issue with calling routines in the BLAS/LAPACK library on that machine.

So I guess you can disable the stokhos tests that fail on this machine for this testing. Please don't disable stokhos overall, since there are a few tests that don't go through this code path and still pass.

bartlettroscoe · 2018-04-25T23:45:07Z

I verified the stokhos tests are in fact failing due to the call to the DSTEQR LAPACK function. I downloaded the reference implementation of that function from netlib, renamed the function, changed stokhos to call that function, and compiled it with the rest of stokhos and all of the stokhos tests passed. So it is definitely some issue with calling routines in the BLAS/LAPACK library on that machine.

@etphipp

That is interesting. It is possible there is something wrong with some of the LAPACK routines. The compiler bug identified #1208 was for an optimized build. But that does not mean there is not also a problem with the BLAS/LAPACK installed there. Perhaps that is why the LAPACK unit tests failed as described in #2410?

@nmhamster, can we have the Test Bed team install the netlib BLAS and LAPACK on 'ride' and 'white' to see if we can reproduce @etphipp's results described above? This can be in addition to the current optimized BLAS and LAPACK installations.

github-actions · 2021-06-05T12:51:42Z

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions · 2021-07-07T12:25:08Z

This issue was closed due to inactivity for 395 days.

bartlettroscoe added the Framework tasks Framework tasks (used internally by Framework team) label Apr 23, 2018

bartlettroscoe mentioned this issue Apr 23, 2018

Set up a CUDA build for an auto PR build #2464

Closed

github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 5, 2021

github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 7, 2021

github-actions bot closed this as completed Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620

Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620

bartlettroscoe commented Apr 23, 2018

etphipp commented Apr 25, 2018

bartlettroscoe commented Apr 25, 2018

etphipp commented Apr 25, 2018

bartlettroscoe commented Apr 25, 2018

etphipp commented Apr 25, 2018

bartlettroscoe commented Apr 25, 2018

github-actions bot commented Jun 5, 2021

github-actions bot commented Jul 7, 2021

Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620

Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620

Comments

bartlettroscoe commented Apr 23, 2018

Description

Steps to Reproduce

etphipp commented Apr 25, 2018

bartlettroscoe commented Apr 25, 2018

etphipp commented Apr 25, 2018

bartlettroscoe commented Apr 25, 2018

etphipp commented Apr 25, 2018

bartlettroscoe commented Apr 25, 2018

github-actions bot commented Jun 5, 2021

github-actions bot commented Jul 7, 2021