Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up builds and tests for all Primary Tested packages for CUDA build on white/ride #2620

Closed
bartlettroscoe opened this issue Apr 23, 2018 · 8 comments
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot.

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/framework

Description

This issue is a placeholder for cleaning up the builds and tests for all of the Primary Tested Trilinos packages for the cuda-debug build on white/ride as described in #2464 (comment).

As of 4/23/2018, the build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once on white for all 53 of the current Primary Tested Trilinos packages passed the configure but has several build and runtime test failures as shown at:

This shows build failures for the packages:

  • ROL (@trilinos/rol)
  • MiniTensor (@lxmota)
  • Zoltan (@trilinos/zoltan)
  • TrilinosCouplings (???)
  • ShyLU_DD (@trilinos/shylu )
  • Domi (@trilinos/domi)

And it shows test failures for the packages:

  • Stokhos (@trilinos/stokhos)
  • ROL
  • Zoltan
  • TrilinosCouplings
  • Piro (@trilinos/piro)
  • FEI (@trilinos/fei)

Steps to Reproduce

Anyone should be able to reproduce any of these package failures on 'white`'(SON) or 'ride' (SRN) as:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettungs.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<PackageName>=ON \
  $TRILINOS_DIR

$ ninja -j16 -k 999999

$ bsub -x -Is -q rhel7F -n 16 ctest -j16

For more details, see:

@bartlettroscoe bartlettroscoe added the Framework tasks Framework tasks (used internally by Framework team) label Apr 23, 2018
@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

I've been looking into the stokhos failures. They appear to be seg faults associated with calls to the LAPACK function STEQR through the Teuchos::LAPACK interface. Were the issues with BLAS/LAPACK ever resolved on this machine?

@bartlettroscoe
Copy link
Member Author

I've been looking into the stokhos failures. They appear to be seg faults associated with calls to the LAPACK function STEQR through the Teuchos::LAPACK interface. Were the issues with BLAS/LAPACK ever resolved on this machine?

@etphipp,

As documented in #1208 last year, Tech-X contractors tracked this down to what looks like a compiler defect in passing some arguments between C++ and C or Fortran. But most of those failures appear to be occurring with compiler optimization turned on as evidenced by the numerous values we see in the 'opt' builds on this system as documented in #2454. But we did see an LAPACK test failing even in the 'debug' build as described in #2410. Given all of those failing Belos and Anasazi tests (which use BLAS and LAPACK) for the 'opt' builds, I don't understand how anyone is using the solvers in Trilinos on this system (but somehow they are).

The solution we were told was to simply wait for the next version of ATS-2 which should be what exists on the new testbed machine waterman. That was the purpose of the placeholder issue #1675. Issues with that system can be discussed in the protected issue:

Given all of that, I don't know that we should try too hard to address failures related to mixed-language calls on this current Power8 system on 'ride' and 'white'

CC: @nmhamster

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

So, you're saying I can ignore these failures?

@bartlettroscoe
Copy link
Member Author

So, you're saying I can ignore these failures?

I am saying that we should disable the tests that look like they are related to this compiler defect.

The only reason this given issue #2620 even exists is because the machine 'white' looks to be the only machine with enough nodes and free time to be able to run an auto PR CUDA build. We really should find (or purchase) better hardware with a more sane compiler and other software but it is what it is.

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

I verified the stokhos tests are in fact failing due to the call to the DSTEQR LAPACK function. I downloaded the reference implementation of that function from netlib, renamed the function, changed stokhos to call that function, and compiled it with the rest of stokhos and all of the stokhos tests passed. So it is definitely some issue with calling routines in the BLAS/LAPACK library on that machine.

So I guess you can disable the stokhos tests that fail on this machine for this testing. Please don't disable stokhos overall, since there are a few tests that don't go through this code path and still pass.

@bartlettroscoe
Copy link
Member Author

I verified the stokhos tests are in fact failing due to the call to the DSTEQR LAPACK function. I downloaded the reference implementation of that function from netlib, renamed the function, changed stokhos to call that function, and compiled it with the rest of stokhos and all of the stokhos tests passed. So it is definitely some issue with calling routines in the BLAS/LAPACK library on that machine.

@etphipp

That is interesting. It is possible there is something wrong with some of the LAPACK routines. The compiler bug identified #1208 was for an optimized build. But that does not mean there is not also a problem with the BLAS/LAPACK installed there. Perhaps that is why the LAPACK unit tests failed as described in #2410?

@nmhamster, can we have the Test Bed team install the netlib BLAS and LAPACK on 'ride' and 'white' to see if we can reproduce @etphipp's results described above? This can be in addition to the current optimized BLAS and LAPACK installations.

@github-actions
Copy link

github-actions bot commented Jun 5, 2021

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 5, 2021
@github-actions
Copy link

github-actions bot commented Jul 7, 2021

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 7, 2021
@github-actions github-actions bot closed this as completed Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. Framework tasks Framework tasks (used internally by Framework team) MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot.
Projects
None yet
Development

No branches or pull requests

2 participants