-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MueLu research and example code build failures for new CUDA ATDM build on hansen/shiller #2319
Comments
MueLu's research code directory should not be enabled by default. This is likely caused by insufficient cmake guards. |
Okay, just let me know what tweaks are needed to build just the tests and examples you want. But note that EMPIRE is setting |
I checked |
@bartlettroscoe How do I best revert this commit?
|
Did you push this to 'develop' yet? If you only committed it locally, then you can remove it in various ways locally as described: Otherwise, send me email and we can converse there. |
Fix has been merged via PR #2326. Not closing yet to make sure that this actually cured the problem. @bartlettroscoe Can you confirm that this is resolved and then close this issue? Thank you! |
Given the six link failures that were shown on CDash, I don't think this one PR #2326 will fix all of them but I sure hope I am wrong :-)
We will take a look at the ATDM builds dashboard tomorrow and that will be the telling. I am looking over the those builds every day as we get them cleaned up so if the problem goes away, I will close this issue. Thanks! |
It looks like your new commit 7a945e6 that was pushed on Saturday and pulled on Sunday as shown at: removed the build error for the file
Can someone take a look at these? I think anyone with access to SRN or SON machines |
Also note that even if you exclude the 8 "Not Run" tests for the missing executables that would not link, there are still 23 failing MueLu tests for these CUDA builds. I was going to wait until these build failures were fixed before posting new GitHub issues for those failures but some MueLu developer might want to look into those too. Those can be see at: |
If folks are gone this week at SIAM PP, I can help, just not today so much. |
@bartlettroscoe I'm not surprised at these failures, as MueLu has an experimental track CUDA build that hasn't been clean for quite a while. Could you temporarily disable these for the ATDM build until the MueLu team has time to look at these? (I am facing a couple March conference/milestone deadlines.) |
@jhux2, okay, I will disable is little as I can to make everything pass. I will also disable the failing tests as well. Then someone on the MueLu team can log onto the |
@bartlettroscoe Yes, in fact the very error @bathmatt reported also appears on the dashboard. By the way, I see this output during the configure process
The link errors refer to symbols templated on |
@trilinos/tpetra developers, Can we set up Tpetra to automatically enable Otherwise, I will try enabling this option for the CUDA build on hansen/shiller and see if this makes the link errors go away. |
@bartlettroscoe What bothers me is that MueLu should not require that node type. Enabling |
Somehow this is getting manifested in the MueLu test suite itself so one only needs to look at how MueLu is using upstream packages and what those packages are doing. If you look at the link failure, for example, today at: you see undefined reference errors coming directly from MueLu source files as well as from Tpetra files and other packages. Can some MueLu developer log into to 'hanen' or 'shiller' and see if they can reproduce these link failures as described above? It should only take a few minutes of person-time to get the build going. Otherwise, I will let you know what happens with setting |
I'm on shiller now, diagnosing the configure process. I believe that's where the error is. |
@bartlettroscoe I tested two options:
For now, can you enable option 2? Longer term, we may be able to relax this requirement in MueLu. |
@bartlettroscoe wrote:
This used to be ON by default, but that changed as part of Kokkos' refactor of CMake / Makefile options. Changing it back would increase the build time quite a bit. Is the problem that downstream packages don't do ETI correctly? That's an issue very much like #74 in that what looks like an easy CMake option to set, actually increases the build time a lot and does not help generality. |
I also confirmed that setting
So the deal is that as long as Epetra is not enabled then you don't need these instantiations? Note that we could add some specialized logic to the file that file gets processed after the final set of enables and disables are determined and therefore, it would have all of the info needed to determine when this needed to be enabled. That would save users from having to figure this out on their own. Again, anything we can do to make Trilinos configure correctly for the requested user configuration will go a long way to reducing the reputation that Trilinos is hard to build (which I hear a lot). So what are the right set of enable variables to look for in order to set |
@bartlettroscoe I don't want to hinder this process; I just want developers to be aware that enabling stuff that most users don't need or (shouldn't) use, might be easy to do, but increases build times and sizes. I don't want developers to get complacent about that. If we decide for now not to fix it, that's fine, but that needs to be a conscious choice. |
In summary: Please go ahead and do what you need to do, but be aware that we're building more than we need. |
The enable of Tpetra_INST_SERIAL=ON may fix many of these build errors.
I would rather build less. But given the option of having a build fail with link failures or building more than the user really needs (but building what they are actually asking for), I think we should error on the side of having the build succeeded. Could problems like this be avoided if MueLu and other packages were better broken up into subpackages? For example, if Panzer only needs the Tpetra adapters from MueLu, and if MeuLu was broken up into subpackages MueLuCore, MueLuEpetra and MueLuTpetra, then if Panzer only defined a dependency on MueLuCore and MueLuTpetra, then the Epetra adapters would never get built and this problem would not exist. |
No, this would not help. At least not for MueLu, since we already have a place for a clean distinction of the Epetra and Tpetra specific code with the Xpetra package. In contrast, it would break one of the core philosophies of MueLu: to be independent of the underlying linear algebra (either Epetra or Tpetra (+ Kokkos as option)). The problem is that people always start using Tpetra in MueLu directly (instead of using Xpetra). Xpetra is meant to deal with the guards and correct instantiations. The problem with Xpetra is, that it needs some more work to enable (or write stubs for) the fancy new features in Tpetra and not all developers are willing or able to invest that additional time and effort doing so. But it's crucial to understand that we do not want to break MueLu apart into two independent pieces (the Epetra and Tpetra part). MueLu provides implementations for rather general multigrid algorithms independent of Epetra and Tpetra. |
@bartlettroscoe @jhux2
We only misuse Tpetra directly (instead of Xpetra) since in Xpetra we have no routine "subCopy", yet (which is available in Tpetra but not Epetra and therefore also not in Xpetra). The right solution would be to either avoid that function call (by doing it locally by hand) or add that functionality to Xpetra. Then we would not have such linker problems. It seems that this code has not been touched for more than two years. Obviously nobody was interested in that feature too much. Maybe one should just delete these algorithms or move them into a separate optional subpackage of all non-maintained code. |
…s#2319, TRIL-171) This should fix the MueLu build failures with CUDA reported in trilinos#2319. I also removed setting Tpetra_INST_SERIAL=ON. This should cut down on the build times over setting Tpetra_INST_SERIAL=ON.
I pushed the commit eee871d which sets Also, using @bathmatt, it is acceptable for EMPIRE if DETAILS (click to expand)Last night I pushed the commit eee871d:
This resulted in all passing builds for MueLu on all of the platforms, including all of the CUDA builds as shown at: The only test failures were for the build Note that all of the Panzer tests on CUDA passed as well as shown at: Therefore, disabling Epetra support in MueLu does not seem to impact Panzer tests at all (there have been 116 Panzer tests run for these CUDA builds for the last several days). Therefore, hopefully this would be okay for EMPIRE? It is also interesting to see the impact this has on the build times for the MueLu build wtih CUDA. Looking at the build The build two days ago on 2018-03-13 was failing and it took 2h50m17s. Then yesterday, the build passed using |
This addresses all of the MueLu test and example build failures reported in
…s#2319, TRIL-171) This should fix the MueLu build failures with CUDA reported in trilinos#2319. I also removed setting Tpetra_INST_SERIAL=ON. This should cut down on the build times over setting Tpetra_INST_SERIAL=ON.
Turns out that some of these Panzer examples test behavior that EMPIRE needs. Therefore, we need to get them working. Therefore, to help get these fixed, and since the rest of the cuda build is not clean yet, we need to turn these back on. Note that this should not enable the Panzer examples for the special "-panzer" builds since the CTest -S driver script will explicitly disable the Panzer examples for those builds. Build and test results on 'shiller' show below (the build passes but there are still some failing test). These builds are not promoted to the "ATDM" Group/Track yet so this will not spam any one with CDash error emails. Enabled Packages: Panzer Build test results: ------------------- 1) cuda-opt => FAILED: passed=150,notpassed=3 => Not ready to push! (55.40 min) 2) cuda-debug => FAILED: passed=151,notpassed=2 => Not ready to push! (66.06 min)
Turns out that some of these Panzer examples test behavior that EMPIRE needs. Therefore, we need to get them working. Therefore, to help get these fixed, and since the rest of the cuda build is not clean yet, we need to turn these back on. Note that this should not enable the Panzer examples for the special "-panzer" builds since the CTest -S driver script will explicitly disable the Panzer examples for those builds. Build and test results on 'shiller' show below (the build passes but there are still some failing test). These builds are not promoted to the "ATDM" Group/Track yet so this will not spam any one with CDash error emails.
I talked with @bathmatt yesterday and he confirmed that EMPIRE does not need Epetra support under MueLu. Therefore, this is resolved and I am closing this as completed. |
Set MueLu_ENABLE_Epetra=OFF See issue trilinos#2319 for discussion regarding this setting.
…nos#2674, trilinos#2319) The CUDA bulid for MueLu was fixed so this disable should not be needed anymore.
…nos#2674, trilinos#2319) The CUDA bulid for MueLu was fixed so this disable should not be needed anymore.
CC: @trilinos/muelu, @fryeguy52
Next Action Status:
Commit eee871d which sets
MueLu_ENABLE_Epertra=OFF
and fixes the build failures.Description
The MueLu package shows build falures for the CUDA ATDM builds today on hansen shown at:
for the builds:
Trilinos-atdm-hansen-shiller-cuda-debug
: https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3412693Trilinos-atdm-hansen-shiller-cuda-opt
: https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3412702The build failures for example at:
all show undefined reference link failues like:
but each executable has a slightly different set of link failures.
It looks like some explicit template instantiations are missing?
Steps to Reproduce:
The instructions to reproduce these build failures can be found starting at:
and clicking "Reproducing ATDM builds locally" which takes you to:
Basically, on
hansen
orshiller
, you just clone the Trilinos repo (with location depicted as$TRILINOS_DIR
below), get on thedevelop
branch. Then create a build directory and do the configure and build as:The text was updated successfully, but these errors were encountered: