Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined reference error coming from stk/boost in nightly Trilinos clang build on CEE for Albany #4676

Closed
ikalash opened this issue Mar 20, 2019 · 49 comments
Labels
client: Albany Issue impacting the Albany project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: STK type: bug The primary issue is a bug in Trilinos code or tests

Comments

@ikalash
Copy link
Contributor

ikalash commented Mar 20, 2019

The Albany Trilinos clang build on the CEE is broken. There is an undefined reference error stemming from stk/boost:

[ 76%] Linking CXX executable STKBalance_stk_balance_m2n.exe
libstk_balance_lib.a(balanceCommandLine.cpp.o): In function `std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > stk::CommandLineParser::get_option_value<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const':
/projects/albany/clang/include/boost/program_options/variables_map.hpp:155: undefined reference to `boost::program_options::abstract_variables_map::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const'
libstk_balance_lib.a(balanceCommandLine.cpp.o): In function `stk::CommandLineParser::CommandLineParser(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
/projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/stk/stk_util/stk_util/command_line/CommandLineParser.hpp:54: undefined reference to `boost::program_options::options_description::options_description(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int)'

http://cdash.sandia.gov/CDash-2-3-0/viewBuildError.php?buildid=82664

Can someone from the @trilinos/stk team please have a look?

@alanw0
Copy link
Contributor

alanw0 commented Mar 20, 2019

Hi Irina, Sorry for the trouble!
stk_balance does depend on boost, I'm puzzled why it's not finding it.
We just recently enabled stk_balance for the cmake build, I'm surprised that it is enabled. Did it enable automatically, or did you purposefully enable it?

Anyway, I think I see an issue in the cmake file. It looks like stk_balance lists boost as an optional dependency, instead of a required dependency. I'll get a change in to fix that, maybe that will help.

@ikalash
Copy link
Contributor Author

ikalash commented Mar 20, 2019

@alanw0 : I think it enabled automatically. We have not changed our nightly build scripts on the CEE in awhile so I don't believe it is intentional. It is interesting that the issue only happens in the clang build. We do use different builds of boost for different compilers. If you could fix it, that would be great.

I forgot to mention that we are building against master Trilinos now in our nightlies instead of develop. I guess I'll have to wait a day or two longer for the change to get into master, unless I switch the CEE nightlies to develop temporarily.

@jhux2 jhux2 added the pkg: STK label Mar 21, 2019
@ikalash ikalash added the client: Albany Issue impacting the Albany project label Mar 22, 2019
@ikalash
Copy link
Contributor Author

ikalash commented Mar 22, 2019

@alanw0 : any updates on this issue? Our nightlies are still failing with the clang compiler, but it could be due to the fact that we're pulling master Trilinos now (if your fix hasn't made it into master yet).

@alanw0
Copy link
Contributor

alanw0 commented Mar 22, 2019

Let me check and see if my pull request went in. I'll get back to you.

@alanw0
Copy link
Contributor

alanw0 commented Mar 22, 2019

@ikalash it was pull request 4682, and it appears to have merged into develop 2 days ago. Perhaps it hasn't made it into master yet.

@ikalash
Copy link
Contributor Author

ikalash commented Mar 22, 2019

@alanw0 I just checked and you are right - the change hasn't made it into master yet. I'll keep an eye out for it in the next few days and close the issue once the nightlies have shown that it has been resolved.

@ikalash
Copy link
Contributor Author

ikalash commented Mar 27, 2019

@alanw0 : unfortunately we are still getting similar compilation errors with the clang compiler on CEE:

http://cdash.sandia.gov/CDash-2-3-0/viewBuildError.php?buildid=82926

I actually switched our nightlies to use develop now instead of master, so your fix should definitely be in.

P.S. Sorry about the delay - we had a broken build due to other issues in Trilinos until today - finally the PR to fix those got merged in yesterday.

@alanw0
Copy link
Contributor

alanw0 commented Mar 27, 2019

@ikalash Irina you're killing me. Just kidding, sorry about the continued errors. I'm looking into the stk_balance cmake files further. I have a pull-request in progress now (4732) which fixes some issues in stk_balance cmake files, but I suspect this is something else. I'll get back to you. I will also work on setting stk_balance to off (not enabled) by default.

@ikalash
Copy link
Contributor Author

ikalash commented Mar 27, 2019

@alanw0 I'm worry :(. Hopefully this is the last of the issues! I have been pushing the Trilinos team to set up a clang build - hopefully once that's in place issues like this will get caught before they affect Albany.

@jhux2
Copy link
Member

jhux2 commented Mar 27, 2019

I have been pushing the Trilinos team to set up a clang build - hopefully once that's in place issues like this will get caught before they affect Albany.

@ikalash Fyi, there are already some clang builds, see here.

@alanw0
Copy link
Contributor

alanw0 commented Mar 27, 2019

@ikalash, I have a new PR in progress (#4745), fingers crossed that it will fix this build error.

@ikalash
Copy link
Contributor Author

ikalash commented Mar 27, 2019

@alanw0 : thanks for the update, I'll keep an eye out for it.

@ikalash
Copy link
Contributor Author

ikalash commented Mar 29, 2019

@alanw0 : it looks like your PR went in 17 hrs ago but we are still having failures in our Clang build

http://cdash.sandia.gov/CDash-2-3-0/viewBuildError.php?buildid=83003

@alanw0
Copy link
Contributor

alanw0 commented Mar 29, 2019

Sorry @ikalash, I'm getting low on ideas about this...
For the short term, can you turn off stk_balance? -DTrilinos_ENABLE_STKBalance:BOOL=OFF
I'll try to figure out what's goin on...

@ikalash
Copy link
Contributor Author

ikalash commented Mar 29, 2019

@alanw0 : that's fine with me, I will go ahead and make the change.

@spdomin
Copy link
Contributor

spdomin commented Apr 1, 2019

@alanw0 @ikalash, sorry to be late to the game, however, I am also seeing this error now on Nalu clang builds...

with errors in: Linking CXX executable stk_balance_m2n.exe

Undefined symbols for architecture x86_64:
"Tpetra::DistObject<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::doImport(Tpetra::SrcDistObject const&, Tpetra::Import<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, Tpetra::CombineMode, bool)", referenced from:

My Clang builds were down for a week. However, other builds seem to be fine. The latest push did not seem to help:

commit 9ed3ec3
Merge: b44c2b2 8063176
Author: trilinos-autotester [email protected]
Date: Thu Mar 28 18:36:02 2019 -0600

Merge Pull Request #4745 from alanw0/Trilinos/fix_cmake_stkbal

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Fix some more cmake issues in stk_balance and stk_util.
PR Author: alanw0

Best,

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 1, 2019

@spdomin What's it "referenced from"?

@alanw0
Copy link
Contributor

alanw0 commented Apr 1, 2019

@spdomin Sorry about the link issues. I've been trying to figure them out but so far without success. I'm puzzled that the issues only show up for clang... As a short term work-around I suggested to Irina that Albany can disable stk-balance since they don't need it. I believe you could also disable it currently, although I believe it will be of interest to you in the future when we improve the MxN capability etc. In the meantime I will continue tracking down the link issues...

@ikalash
Copy link
Contributor Author

ikalash commented Apr 1, 2019

@spdomin @alanw0 : just to follow up: the short-term fix worked for Albany.

We've seen in our Albany testing that a lot of issues show up only with clang - that is why having a clang build is important. It would be great if Trilinos tested with the same clang compiler as codes like Albany so that these issues get caught / resolved before they impact applications.

I am happy to provide the modules / configure script where the issue shows up @alanw0 if you'd like. It's on the CEE so it should be pretty easy for you to reproduce the problem.

@spdomin
Copy link
Contributor

spdomin commented Apr 1, 2019

Alan, yes, I have disabled re-balance as we currently do not have this active in the main code base. Let me know when you feel that this option can be turned back on. The "broken window" comes to mind, however, I know that you are on it:) Let me know if you would like a second pair of eyes on this and I sit in on a STK team room session with you.

I have the same opinion of clang as Irina shared. It's a good compiler that picks up things that other compilers miss - not to mention that is is the zero-work compiler on MacOS.

@alanw0
Copy link
Contributor

alanw0 commented Apr 1, 2019

@spdomin I just noticed that the error you are seeing is different than what Albany is seeing. Yours can be fixed (I'm pretty sure) by adding this to your trilinos cmake-configure step:

-DTpetra_INST_INT_LONG_LONG:BOOL=ON 

Albany is seeing an undefined reference to boost::program_options. Perhaps you will see that too, once you add this tpetra long-long flag...

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 1, 2019

Tpetra enables GlobalOrdinal=long long by default, unless GlobalOrdinal=long is enabled. @alanw0 Why do we need both long and long long? That means you're building the whole solver stack twice.

@alanw0
Copy link
Contributor

alanw0 commented Apr 1, 2019

@mhoemmen I don't think we need both, and I confess I don't know enough about the configuration process to know why both are being enabled. All I know is we need long-long because stk-balance gives 64-bit stuff to Zoltan2. If tpetra enables long-long by default, then I'm even more puzzled. Because in the nalu-wind project we also hit this error, and it was fixed by adding the Tpetra_INST_INT_LONG_LONG flag.

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 1, 2019

@alanw0 wrote:

All I know is we need long-long because stk-balance gives 64-bit stuff to Zoltan2

We need to fix STK Balance so that it uses GlobalOrdinal=long with Zoltan2. That should work perfectly fine. If it doesn't, that's a Zoltan2 bug.

@alanw0
Copy link
Contributor

alanw0 commented Apr 1, 2019

@mhoemmen Mark, stk has these two declarations for the type of global identifiers:
In stk_balance: typedef long long BalanceGlobalNumber;
In stk_mesh: typedef uint64_t EntityId;
Perhaps these two lines are not entirely consistent with each other. (I think we were under the impression that Zoltan2 needed the type to be signed.) But I don't think 'GlobalOrdinal=long' will be correct because long isn't 64-bit on every platform.
But your help will be greatly appreciated in getting this straightened out!

@mhoemmen
Copy link
Contributor

mhoemmen commented Apr 1, 2019

@alanw0 Thanks for checking! If we're worried about GlobalOrdinal being 64 bits, then we should always use GlobalOrdinal=long long.

STK should be able to use whatever global index type it wants, as long as it always gives Zoltan2 the index type that Zoltan2 wants. That may imply type conversion, but we need to be OK with that. Zoltan2 will do something much more expensive (global load balancing) than just copying an array of indices locally on every process.

@spdomin
Copy link
Contributor

spdomin commented Apr 2, 2019

I would prefer not having to build the solver stack twice - especially if we are not using re-balance at present. Also, every config file that I have seen Nalu use specifies:

-DTpetra_INST_INT_LONG:BOOL=ON \

I think that unless windows is desired, this is safe:)

https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

At any rate, let me know what the long term solution is here. What happens if I turn off LONG and keep the default to LONG LONG? Does STK now get confused?

@alanw0
Copy link
Contributor

alanw0 commented Apr 2, 2019

@spdomin Mark and I were talking about this more yesterday, and I'm trying a couple of sierra builds to figure out exactly what we need to instantiate. I don't think you'll need both types. I'll let you know, hopefully later today.

@alanw0
Copy link
Contributor

alanw0 commented Apr 2, 2019

@mhoemmen I sent you an email about how aria tests fail when I turn off Tpetra_INST_INT_LONG, even though the build/link was successful. Still some investigation to do...

@spdomin
Copy link
Contributor

spdomin commented Apr 3, 2019

Short term, I removed STK_Rebalance from our config file. However, STK_Unit has tests that exercise STK_Rebalance. As such, I turned off STK_Unit. However, now, our unit tests are failing to build (as expected) as our unit tests pull in STK unit test mesh fixtures:

/Users/naluIt/gitHubWork/nightlyBuildAndTest/Nalu/unit_tests/UnitTestHexElementPromotion.C:15:10: fatal error:
'stk_unit_tests/stk_mesh_fixtures/HexFixture.hpp' file not found
#include <stk_unit_tests/stk_mesh_fixtures/HexFixture.hpp>

Perhaps we should create a new ticket that deals with this Long, Long/Long issue? When I joined the discussion, I was under the impression that Albany and Nalu shared the same rebalance build issue.

@bartlettroscoe
Copy link
Member

Looks like we just hit this with the new ATDM Trilinos intel-18.0.5 build. See #5335 (comment).

Adding the ATDM labels to this issue as well.

@alanw0, do we just need to add a BoostLibs reference to these STK subpackages to fix this for now?

@bartlettroscoe bartlettroscoe added client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code type: bug The primary issue is a bug in Trilinos code or tests PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Jun 10, 2019
@bartlettroscoe
Copy link
Member

@kddevin (Trilinos Data Services Product Area Lead)

FYI: This is breaking the new intel-18.0.5 builds that EMPIRE is relying on (see #5335). It looks like they have a workaround in place for now but it is bringing down the ATDM Trilinos builds protecting this build for EMPIRE (and also we would assume GEMMA).

@bartlettroscoe bartlettroscoe added the ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs label Jun 10, 2019
@alanw0
Copy link
Contributor

alanw0 commented Jun 10, 2019

This looks like it's due to the stk change to make boostlib optional rather than required. i.e., if boostlib is not enabled, then some stk sub-packages get disabled. I think you can get the builds working again by adding '-DTPL_ENABLE_BoostLib:BOOL=ON'.

@bartlettroscoe
Copy link
Member

@alanw0 said:

I think you can get the builds working again by adding '-DTPL_ENABLE_BoostLib:BOOL=ON'.

That appears not to be the issue, at least not with the ATDM Trilinos configuration. As shown here the BoostLib TPL is enabled showing:

Final set of enabled TPLs:  MPI BLAS LAPACK Boost HDF5 Netcdf BoostLib DLlib 8

...

Processing enabled TPL: BoostLib (enabled explicitly, disable with -DTPL_ENABLE_BoostLib=OFF)
-- BoostLib_LIBRARY_NAMES='boost_program_options;boost_system'
-- TPL_BoostLib_LIBRARIES='/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/lib/libboost_program_options.so;/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/lib/libboost_system.so'
-- Searching for headers in BoostLib_INCLUDE_DIRS='/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/include'
-- Searching for a header file in the set "boost/version.hpp":
--   Searching for header 'boost/version.hpp' ...
--     Found header '/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/include/boost/version.hpp'
-- Searching for a header file in the set "boost/mpl/at.hpp":
--   Searching for header 'boost/mpl/at.hpp' ...
--     Found header '/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/include/boost/mpl/at.hpp'
-- Found TPL 'BoostLib' include dirs '/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/include'
-- TPL_BoostLib_INCLUDE_DIRS='/projects/sems/install/rhel7-x86_64/sems/tpl/boost/1.59.0/intel/18.0.5/base/include'

My guess is that some STK package that requires BoostLib is not properly declaring a dependence on BoostLib but only Boost (therefore finding the header files but not the boost libs).

We really need to refactor this to just have a single Boost TPL and then have it have optional components and get rid of the hacked BoostLib TPL. We might be able to make that backward compatible if we are careful.

@nmhamster
Copy link
Contributor

@roscoebartlett / @alanw0 - I enabled BoostLib for NALU but it didn't completely fix the issue. In addition, I had to manually add the libraries paths to my link line. I was under the impression that this was NALU related but perhaps not. As background, why this wasn't completely crashing was that the system had a Boost install in the /usr area but it was much older than the TPL provided by the developer pack on the machine. Hence, all sorts of nasties were occurring. Could that be happening?

@bartlettroscoe
Copy link
Member

Looking at the detailed link lines and poking around some should determine what the problem is and how to fix this. I can almost guarantee the problem is a missing BoostLib line in a STK Dependencies.cmake file.

@alanw0
Copy link
Contributor

alanw0 commented Jun 10, 2019

@bartlettroscoe you're probably right, but it's not obvious to me where that erroneous dependency or missing dependency is. We'll try to look into it...

@nmhamster
Copy link
Contributor

@bartlettroscoe - I'm sorry I got your nick-name incorrect above. Blame it on my morning coffee.

@bartlettroscoe
Copy link
Member

FYI: This might be caused by an ABI problem and not with TriBITS or STK CMake files. See #5335 (comment).

@bartlettroscoe
Copy link
Member

FYI: We confirmed this is in fact an ABI issue. The workaround for our case in #5335 was to set -D_GLIBCXX_USE_CXX11_ABI=0 (see #5365).

@bartlettroscoe
Copy link
Member

This has been resolved for ATDM so removing the "ATDM" label.

@bartlettroscoe bartlettroscoe removed ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project client: EMPIRE All issues that most directly target the ATDM EMPIRE code labels Jun 20, 2019
@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Aug 18, 2021
@github-actions
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Sep 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: Albany Issue impacting the Albany project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: STK type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

8 participants