Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MueLu research and example code build failures for new CUDA ATDM build on hansen/shiller #2319

Closed
bartlettroscoe opened this issue Mar 2, 2018 · 30 comments
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 2, 2018

CC: @trilinos/muelu, @fryeguy52

Next Action Status:

Commit eee871d which sets MueLu_ENABLE_Epertra=OFF and fixes the build failures.

Description

The MueLu package shows build falures for the CUDA ATDM builds today on hansen shown at:

for the builds:

The build failures for example at:

all show undefined reference link failues like:

CMakeFiles/MueLu_ImportTest.dir/Import.cpp.o: In function `int main_<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >(Teuchos::CommandLineProcessor&, Xpetra::UnderlyingLib, int, char**)':
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:153: undefined reference to `Tpetra::Map<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Map(unsigned long, Teuchos::ArrayView<int const> const&, int, Teuchos::RCP<Teuchos::Comm<int> const> const&, Teuchos::RCP<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:154: undefined reference to `Tpetra::Map<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Map(unsigned long, Teuchos::ArrayView<int const> const&, int, Teuchos::RCP<Teuchos::Comm<int> const> const&, Teuchos::RCP<Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:156: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::MultiVector(Teuchos::RCP<Tpetra::Map<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, unsigned long, bool)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:157: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getDataNonConst(unsigned long)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:167: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::MultiVector(Teuchos::RCP<Tpetra::Map<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, unsigned long, bool)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:169: undefined reference to `Tpetra::Export<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Export(Teuchos::RCP<Tpetra::Map<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&, Teuchos::RCP<Tpetra::Map<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const> const&)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:170: undefined reference to `Tpetra::DistObject<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::doExport(Tpetra::SrcDistObject const&, Tpetra::Export<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Tpetra::CombineMode)'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:171: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::getData(unsigned long) const'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:169: undefined reference to `Tpetra::Export<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~Export()'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:167: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:156: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:169: undefined reference to `Tpetra::Export<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~Export()'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:167: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
/home/jenkins/hansen/workspace/Trilinos-atdm-hansen-shiller-cuda-debug/SRC_AND_BUILD/Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp:156: undefined reference to `Tpetra::MultiVector<double, int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::~MultiVector()'
collect2: error: ld returned 1 exit status

but each executable has a slightly different set of link failures.

It looks like some explicit template instantiations are missing?

Steps to Reproduce:

The instructions to reproduce these build failures can be found starting at:

and clicking "Reproducing ATDM builds locally" which takes you to:

Basically, on hansen or shiller, you just clone the Trilinos repo (with location depicted as $TRILINOS_DIR below), get on the develop branch. Then create a build directory and do the configure and build as:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-opt

$ cmake \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
  $TRILINOS_DIR

$ make -j16
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: NOX client: ATDM Any issue primarily impacting the ATDM project labels Mar 2, 2018
@jhux2
Copy link
Member

jhux2 commented Mar 2, 2018

MueLu's research code directory should not be enabled by default. This is likely caused by insufficient cmake guards.

@bartlettroscoe
Copy link
Member Author

MueLu's research code directory should not be enabled by default. This is likely caused by insufficient cmake guards.

@jhux2,

Okay, just let me know what tweaks are needed to build just the tests and examples you want.

But note that EMPIRE is setting MueLu_ENABLE_Experimental=ON so if there are tests in MueLu needed to test and support that functionality, then they should be built and run the various ATDM platforms. If you think that ATDM apps should not be using MueLu_ENABLE_Experimental=ON, then that is a conversation that you need to have with them. But as long as they are using it, it needs to be supported and kept stable like any other piece of code in Trilinos that are using.

@mayrmt
Copy link
Member

mayrmt commented Mar 2, 2018

I checked Trilinos/packages/muelu/research/luc/region_algorithms/Import.cpp accidentally. We can just revert the commit 506af3b since we don't need this anyways.

@mayrmt
Copy link
Member

mayrmt commented Mar 2, 2018

@bartlettroscoe How do I best revert this commit?

  • Reverting it locally and the pushing to Trilinos/develop with the checkin-script?
  • Reverting it locally and issuing a pull request?

@bartlettroscoe
Copy link
Member Author

How do I best revert this commit?

Did you push this to 'develop' yet? If you only committed it locally, then you can remove it in various ways locally as described:

Otherwise, send me email and we can converse there.

@mayrmt
Copy link
Member

mayrmt commented Mar 3, 2018

Fix has been merged via PR #2326. Not closing yet to make sure that this actually cured the problem.

@bartlettroscoe Can you confirm that this is resolved and then close this issue? Thank you!

@bartlettroscoe
Copy link
Member Author

Fix has been merged via PR #2326. Not closing yet to make sure that this actually cured the problem.

@mayrmt,

Given the six link failures that were shown on CDash, I don't think this one PR #2326 will fix all of them but I sure hope I am wrong :-)

@bartlettroscoe Can you confirm that this is resolved and then close this issue? Thank you!

We will take a look at the ATDM builds dashboard tomorrow and that will be the telling. I am looking over the those builds every day as we get them cleaned up so if the problem goes away, I will close this issue.

Thanks!

@bartlettroscoe
Copy link
Member Author

@mayrmt,

It looks like your new commit 7a945e6 that was pushed on Saturday and pulled on Sunday as shown at:

removed the build error for the file packages/muelu/research/luc/region_algorithms/Import.cpp but the other link errors remain as shown in automated testing at:

Can someone take a look at these? I think anyone with access to SRN or SON machines shiller or hansen can reproduce these failures as described in the "Steps to Reproduce" above.

@bartlettroscoe
Copy link
Member Author

Also note that even if you exclude the 8 "Not Run" tests for the missing executables that would not link, there are still 23 failing MueLu tests for these CUDA builds. I was going to wait until these build failures were fixed before posting new GitHub issues for those failures but some MueLu developer might want to look into those too. Those can be see at:

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 5, 2018

If folks are gone this week at SIAM PP, I can help, just not today so much.

@jhux2
Copy link
Member

jhux2 commented Mar 5, 2018

@bartlettroscoe I'm not surprised at these failures, as MueLu has an experimental track CUDA build that hasn't been clean for quite a while. Could you temporarily disable these for the ATDM build until the MueLu team has time to look at these? (I am facing a couple March conference/milestone deadlines.)

@bartlettroscoe
Copy link
Member Author

I'm not surprised at these failures, as MueLu has an experimental track CUDA build that hasn't been clean for quite a while. Could you temporarily disable these for the ATDM build until the MueLu team has time to look at these? (I am facing a couple March conference/milestone deadlines.)

@jhux2, okay, I will disable is little as I can to make everything pass. I will also disable the failing tests as well. Then someone on the MueLu team can log onto the shiller or hansen and work out all of these issues when they have time.

@bartlettroscoe
Copy link
Member Author

@bathmatt, @jhux2, @mhoemmen, and @srajama1,

I was wrong, I did create a GitHub issue for these MueLu build failures. Could this be related to the build failures that @bathmatt reported for EMPIRE? It looks like some explicit instantiations are missing.

@jhux2
Copy link
Member

jhux2 commented Mar 13, 2018

I was wrong, I did create a GitHub issue for these MueLu build failures. Could this be related to the build failures that @bathmatt reported for EMPIRE? It looks like some explicit instantiations are missing.

@bartlettroscoe Yes, in fact the very error @bathmatt reported also appears on the dashboard.

By the way, I see this output during the configure process

NOTE: Kokkos::Serial is ON (the CMake option Kokkos_ENABLE_Serial is ON), but the corresponding Tpetra Node type is disabled.  If you want to enable instantiation and use of Kokkos::Serial in Tpetra, please also set the CMake option Tpetra_INST_SERIAL:BOOL=ON.  If you use the Kokkos::Serial Node type in Tpetra without doing this, you will get link errors!
-- Tpetra execution space availability (ON means available): 
--   - Serial:  OFF
--   - Threads: OFF
--   - OpenMP:  OFF
--   - Cuda:    ON

The link errors refer to symbols templated on Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, so the cmake message might be relevant.

@bartlettroscoe
Copy link
Member Author

By the way, I see this output during the configure process

NOTE: Kokkos::Serial is ON (the CMake option Kokkos_ENABLE_Serial is ON), but the corresponding Tpetra Node type is disabled.  If you want to enable instantiation and use of Kokkos::Serial in Tpetra, please also set the CMake option Tpetra_INST_SERIAL:BOOL=ON.  If you use the Kokkos::Serial Node type in Tpetra without doing this, you will get link errors!
-- Tpetra execution space availability (ON means available): 
--   - Serial:  OFF
--   - Threads: OFF
--   - OpenMP:  OFF
--   - Cuda:    ON

The link errors refer to symbols templated on Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, so the cmake message might be relevant.

@trilinos/tpetra developers,

Can we set up Tpetra to automatically enable Tpetra_INST_SERIAL:BOOL=ON by default instead of just printing out this warning message? In general, we need to be looking for ways to make it easier for users to configure Trilinos correctly by just setting obvious options.

Otherwise, I will try enabling this option for the CUDA build on hansen/shiller and see if this makes the link errors go away.

@jhux2
Copy link
Member

jhux2 commented Mar 13, 2018

@bartlettroscoe What bothers me is that MueLu should not require that node type. Enabling Tpetra_INST_SERIAL:BOOL=ON might fix this error, but doesn't address the underlying problem. (The build time and executable size will go up, too.)

@bartlettroscoe
Copy link
Member Author

What bothers me is that MueLu should not require that node type. Enabling Tpetra_INST_SERIAL:BOOL=ON might fix this error, but doesn't address the underlying problem. (The build time and executable size will go up, too.)

@jhux2,

Somehow this is getting manifested in the MueLu test suite itself so one only needs to look at how MueLu is using upstream packages and what those packages are doing. If you look at the link failure, for example, today at:

you see undefined reference errors coming directly from MueLu source files as well as from Tpetra files and other packages.

Can some MueLu developer log into to 'hanen' or 'shiller' and see if they can reproduce these link failures as described above? It should only take a few minutes of person-time to get the build going.

Otherwise, I will let you know what happens with setting Tpetra_INST_SERIAL:BOOL=ON for these CUDA builds.

@jhux2
Copy link
Member

jhux2 commented Mar 13, 2018

I'm on shiller now, diagnosing the configure process. I believe that's where the error is.

@jhux2
Copy link
Member

jhux2 commented Mar 14, 2018

@bartlettroscoe I tested two options:

  1. -DTpetra_ENABLE_Epetra:BOOL=OFF. This fixed the error. However, EMPIRE requires Panzer, which requires Epetra, so this is not a workable option.
  2. -DTpetra_INST_SERIAL:BOOL=ON. This fixed the error, at the cost of increasing the build time and probably any executables.

For now, can you enable option 2? Longer term, we may be able to relax this requirement in MueLu.

@mhoemmen
Copy link
Contributor

@bartlettroscoe wrote:

Can we set up Tpetra to automatically enable Tpetra_INST_SERIAL:BOOL=ON by default instead of just printing out this warning message? In general, we need to be looking for ways to make it easier for users to configure Trilinos correctly by just setting obvious options.

This used to be ON by default, but that changed as part of Kokkos' refactor of CMake / Makefile options. Changing it back would increase the build time quite a bit. Is the problem that downstream packages don't do ETI correctly? That's an issue very much like #74 in that what looks like an easy CMake option to set, actually increases the build time a lot and does not help generality.

@bartlettroscoe
Copy link
Member Author

  1. -DTpetra_INST_SERIAL:BOOL=ON. This fixed the error, at the cost of increasing the build time and probably any executables.

For now, can you enable option 2? Longer term, we may be able to relax this requirement in MueLu.

@jhux2,

I also confirmed that setting Tpetra_INST_SERIAL=ON fixed the build of the MueLu tests and examples. I will go ahead and push that updated ATDMDevEnv.cmake file.

This used to be ON by default, but that changed as part of Kokkos' refactor of CMake / Makefile options. Changing it back would increase the build time quite a bit.

@mhoemmen,

So the deal is that as long as Epetra is not enabled then you don't need these instantiations?

Note that we could add some specialized logic to the file Trilinos/cmake/ProjectCompilerPostConfigure.cmake that could turn on Tpetra_INST_SERIAL=ON when it detects that MueLu Tpetra and Epetra support are enabled. As shown at:

that file gets processed after the final set of enables and disables are determined and therefore, it would have all of the info needed to determine when this needed to be enabled. That would save users from having to figure this out on their own. Again, anything we can do to make Trilinos configure correctly for the requested user configuration will go a long way to reducing the reputation that Trilinos is hard to build (which I hear a lot).

So what are the right set of enable variables to look for in order to set Tpetra_INST_SERIAL=ON? This will result in this getting enabled only when it needs to be.

@mhoemmen
Copy link
Contributor

@bartlettroscoe I don't want to hinder this process; I just want developers to be aware that enabling stuff that most users don't need or (shouldn't) use, might be easy to do, but increases build times and sizes. I don't want developers to get complacent about that. If we decide for now not to fix it, that's fine, but that needs to be a conscious choice.

@mhoemmen
Copy link
Contributor

In summary: Please go ahead and do what you need to do, but be aware that we're building more than we need.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 14, 2018
The enable of Tpetra_INST_SERIAL=ON may fix many of these build errors.
@bartlettroscoe
Copy link
Member Author

In summary: Please go ahead and do what you need to do, but be aware that we're building more than we need.

I would rather build less. But given the option of having a build fail with link failures or building more than the user really needs (but building what they are actually asking for), I think we should error on the side of having the build succeeded.

Could problems like this be avoided if MueLu and other packages were better broken up into subpackages? For example, if Panzer only needs the Tpetra adapters from MueLu, and if MeuLu was broken up into subpackages MueLuCore, MueLuEpetra and MueLuTpetra, then if Panzer only defined a dependency on MueLuCore and MueLuTpetra, then the Epetra adapters would never get built and this problem would not exist.

@tawiesn
Copy link
Contributor

tawiesn commented Mar 14, 2018

@bartlettroscoe @jhux2

Could problems like this be avoided if MueLu and other packages were better broken up into subpackages?

No, this would not help. At least not for MueLu, since we already have a place for a clean distinction of the Epetra and Tpetra specific code with the Xpetra package. In contrast, it would break one of the core philosophies of MueLu: to be independent of the underlying linear algebra (either Epetra or Tpetra (+ Kokkos as option)). The problem is that people always start using Tpetra in MueLu directly (instead of using Xpetra). Xpetra is meant to deal with the guards and correct instantiations. The problem with Xpetra is, that it needs some more work to enable (or write stubs for) the fancy new features in Tpetra and not all developers are willing or able to invest that additional time and effort doing so. But it's crucial to understand that we do not want to break MueLu apart into two independent pieces (the Epetra and Tpetra part). MueLu provides implementations for rather general multigrid algorithms independent of Epetra and Tpetra.

@tawiesn
Copy link
Contributor

tawiesn commented Mar 14, 2018

@bartlettroscoe @jhux2
For example: i just found that comment in the source code of the ProjectorSmootherFactory:

// TAW: Oct 16 2015: subCopy is not part of Xpetra. One should either add it to Xpetra
// or replace this call by a local loop. I'm not motivated to do this now...

We only misuse Tpetra directly (instead of Xpetra) since in Xpetra we have no routine "subCopy", yet (which is available in Tpetra but not Epetra and therefore also not in Xpetra). The right solution would be to either avoid that function call (by doing it locally by hand) or add that functionality to Xpetra. Then we would not have such linker problems. It seems that this code has not been touched for more than two years. Obviously nobody was interested in that feature too much. Maybe one should just delete these algorithms or move them into a separate optional subpackage of all non-maintained code.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 15, 2018
…s#2319, TRIL-171)

This should fix the MueLu build failures with CUDA reported in trilinos#2319.

I also removed setting Tpetra_INST_SERIAL=ON.  This should cut down on the
build times over setting Tpetra_INST_SERIAL=ON.
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 15, 2018

I pushed the commit eee871d which sets MueLu_ENABLE_Epertra=OFF instead of Tpetra_INST_SERIAL=ON and it passed all of the builds, including all of the CUDA builds. This also cleared up all of the failing MueLu tests (except for one build and that looks to be a separate issue).

Also, using MueLu_ENABLE_Epertra=OFF instead of Tpetra_INST_SERIAL=ON cut the cumulative package-by-package build time from 3h9m43s to 2h53m65s so the savings is not insignificant.

@bathmatt, it is acceptable for EMPIRE if MueLu_ENABLE_Epertra=OFF is set? Does EMPIRE need Epetra support under MueLu?

DETAILS (click to expand)

Last night I pushed the commit eee871d:

commit eee871d803e2d0a60c60710071f920365687fdcb
Author: Roscoe A. Bartlett <[email protected]>
Date:   Wed Mar 14 17:34:01 2018 -0600

    Set MueLu_ENABLE_Epertra=OFF to fix MueLu CUDA link failures (#2319, TRIL-171)
    
    This should fix the MueLu build failures with CUDA reported in #2319.
    
    I also removed setting Tpetra_INST_SERIAL=ON.  This should cut down on the
    build times over setting Tpetra_INST_SERIAL=ON.

diff --git a/cmake/std/atdm/ATDMDevEnv.cmake b/cmake/std/atdm/ATDMDevEnv.cmake
index 39a8390..53ac0d8 100644
--- a/cmake/std/atdm/ATDMDevEnv.cmake
+++ b/cmake/std/atdm/ATDMDevEnv.cmake
@@ -123,12 +123,10 @@ ATDM_SET_CACHE(Kokkos_ENABLE_Debug_Bounds_Check "${ATDM_BOUNDS_CHECK}" CACHE BOO
 ATDM_SET_CACHE(KOKKOS_ARCH "$ENV{ATDM_CONFIG_KOKKOS_ARCH}" CACHE STRING)
 ATDM_SET_CACHE(EpetraExt_ENABLE_HDF5 OFF CACHE BOOL)
 ATDM_SET_CACHE(MueLu_ENABLE_Experimental ON CACHE BOOL)
+ATDM_SET_CACHE(MueLu_ENABLE_Epetra OFF CACHE BOOL)
 ATDM_SET_CACHE(Panzer_ENABLE_FADTYPE "Sacado::Fad::DFad<RealType>" CACHE STRING)
 ATDM_SET_CACHE(Phalanx_KOKKOS_DEVICE_TYPE "${ATDM_NODE_TYPE}" CACHE STRING)
 ATDM_SET_CACHE(Phalanx_SHOW_DEPRECATED_WARNINGS OFF CACHE BOOL)
-IF (ATDM_USE_CUDA)
-  ATDM_SET_CACHE(Tpetra_INST_SERIAL "${ATDM_USE_CUDA}" CACHE BOOL)
-ENDIF()
 ATDM_SET_CACHE(Tpetra_INST_CUDA "${ATDM_USE_CUDA}" CACHE BOOL)
 ATDM_SET_CACHE(Xpetra_ENABLE_Experimental ON CACHE BOOL)

This resulted in all passing builds for MueLu on all of the platforms, including all of the CUDA builds as shown at:

The only test failures were for the build Trilinos-atdm-white-ride-cuda-opt on white and that looks to be assoicated with how the tests are being run and not a MueLu problem at this point. (I will create another GitHub issue to look into that problem.)

Note that all of the Panzer tests on CUDA passed as well as shown at:

Therefore, disabling Epetra support in MueLu does not seem to impact Panzer tests at all (there have been 116 Panzer tests run for these CUDA builds for the last several days). Therefore, hopefully this would be okay for EMPIRE?

It is also interesting to see the impact this has on the build times for the MueLu build wtih CUDA. Looking at the build Trilinos-atdm-hansen-shiller-cuda-opt on shiller over the last week at:

The build two days ago on 2018-03-13 was failing and it took 2h50m17s. Then yesterday, the build passed using Tpetra_INST_SERIAL=ON and it took 3h9m43s. And then today using MueLu_ENABLE_Epetra=OFF (and Tpetra_INST_SERIAL=OFF impicitly set), it took 2h53m65s. Therefore, we can see the approach of using MueLu_ENABLE_Epetra=OFF vs. Tpetra_INST_SERIAL=ON looks to have shaved off about 16m out of a build that takes about 3 hours (using a package-by-package build so build times are a bit inflated). But that is not too bad. And you can't really compare the build time to the case where the build failed because there was a bunchy of executbles that aborted their link because they were missing link symbols.

@bartlettroscoe bartlettroscoe added stage: in review Primary work is completed and now is just waiting for human review and/or test feedback and removed stage: in progress Work on the issue has started labels Mar 15, 2018
kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018
This addresses all of the MueLu test and example build failures reported in
kyungjoo-kim pushed a commit to kyungjoo-kim/Trilinos that referenced this issue Mar 16, 2018
…s#2319, TRIL-171)

This should fix the MueLu build failures with CUDA reported in trilinos#2319.

I also removed setting Tpetra_INST_SERIAL=ON.  This should cut down on the
build times over setting Tpetra_INST_SERIAL=ON.
bartlettroscoe added a commit that referenced this issue Mar 20, 2018
Turns out that some of these Panzer examples test behavior that EMPIRE needs.
Therefore, we need to get them working.  Therefore, to help get these fixed,
and since the rest of the cuda build is not clean yet, we need to turn these
back on.

Note that this should not enable the Panzer examples for the special "-panzer"
builds since the CTest -S driver script will explicitly disable the Panzer
examples for those builds.

Build and test results on 'shiller' show below (the build passes but there are
still some failing test).  These builds are not promoted to the "ATDM"
Group/Track yet so this will not spam any one with CDash error emails.

Enabled Packages: Panzer

Build test results:
-------------------
1) cuda-opt => FAILED: passed=150,notpassed=3 => Not ready to push! (55.40 min)
2) cuda-debug => FAILED: passed=151,notpassed=2 => Not ready to push! (66.06 min)
bartlettroscoe added a commit that referenced this issue Mar 20, 2018
Turns out that some of these Panzer examples test behavior that EMPIRE needs.
Therefore, we need to get them working.  Therefore, to help get these fixed,
and since the rest of the cuda build is not clean yet, we need to turn these
back on.

Note that this should not enable the Panzer examples for the special "-panzer"
builds since the CTest -S driver script will explicitly disable the Panzer
examples for those builds.

Build and test results on 'shiller' show below (the build passes but there are
still some failing test).  These builds are not promoted to the "ATDM"
Group/Track yet so this will not spam any one with CDash error emails.
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests and removed type: bug The primary issue is a bug in Trilinos code or tests labels Mar 20, 2018
@bartlettroscoe
Copy link
Member Author

I talked with @bathmatt yesterday and he confirmed that EMPIRE does not need Epetra support under MueLu. Therefore, this is resolved and I am closing this as completed.

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 26, 2018
ndellingwood added a commit to ndellingwood/Trilinos that referenced this issue May 8, 2018
Set MueLu_ENABLE_Epetra=OFF
See issue trilinos#2319 for discussion regarding this setting.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Oct 24, 2018
…nos#2674, trilinos#2319)

The CUDA bulid for MueLu was fixed so this disable should not be needed
anymore.
trilinos-autotester added a commit that referenced this issue Oct 24, 2018
…lu-disable-epetra

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Remove MueLu_ENABLE_Epetra=OFF for EMPIRE ATDM Trilinos config (#2674, #2319)
PR Author: bartlettroscoe
@bartlettroscoe bartlettroscoe added the PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area label Nov 30, 2018
tjfulle pushed a commit to tjfulle/Trilinos that referenced this issue Dec 6, 2018
…nos#2674, trilinos#2319)

The CUDA bulid for MueLu was fixed so this disable should not be needed
anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants