Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos: Not rebuilding libkokkoscore correctly on rebuilds? #6855

Closed
bartlettroscoe opened this issue Feb 18, 2020 · 14 comments
Closed

Kokkos: Not rebuilding libkokkoscore correctly on rebuilds? #6855

bartlettroscoe opened this issue Feb 18, 2020 · 14 comments
Labels
ATDM Sev: Critical Problems that critically damage ability to run ATDM Trilinos builds much less allow APP updates client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Feb 18, 2020

As part of debugging a SPARC Trilinos Integration Build error (see SPAR-767), it was discovered that libkokkoscore was not getting correctly rebuilt after the Kokkos 2.99 promotion after the merging of the PR #6671

This issue is to try to characterize the problem and see if it can be reproduced so it can be fixed. We really need to maintain the ability to rebuild Trilinos reliability since it saves a massive amount of CPU cycles in the ATDM Trilinos builds and saves a lot developer time if they can just rebuild and not have to blow away their build directories all the time.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests client: ATDM Any issue primarily impacting the ATDM project ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams labels Feb 18, 2020
@ndellingwood
Copy link
Contributor

Adding @jjwilke

@bartlettroscoe
Copy link
Member Author

I will post back where what I find and try to characterize this problem. I have build directories that have been rebuilding for months that show this problem. When I blew away the build directory and let it build from scratch, the correct libkokkoscore.a got built (see SPAR-767). Hopefully I can simulate what happened in a new source and build directory so we can reproduce this.

Stay tuned ...

@ndellingwood
Copy link
Contributor

Linking a couple kokkos issues that may help investigating:

kokkos/kokkos#1902
PR (cmake overhaul) kokkos/kokkos#2104
PR (remove KOKKOS_SEPARATE_LIBS) kokkos/kokkos#2447

@bartlettroscoe
Copy link
Member Author

kokkos/kokkos#1902

That is not the problem. When you build Trilinos from scratch you get 'libkokkoscore.a'.

kokkos/kokkos#2104

Of course that big refactoring that is what I am suspecting is the cause here. I will see.

@bartlettroscoe
Copy link
Member Author

FYI: I have been able to create a local reproducer. I will post a detailed comment with instructions on how to run the simulation and prove that the Kokkos_Serial.cpp.o file is not getting rebuilt correctly (but when you build from scratch it does). Using ninja debugging aids it should hopefully be pretty easy to figure out what the problem is and how to fix this so this does not happen in the future with rebuilds of Trilinos.

bartlettroscoe added a commit that referenced this issue Feb 20, 2020
)

This is to get all of the ATDM Trilinos builds to build from scratch tomorrow
so that the SPARC Trilinos Integration builds will start working against on
2020-02-21.  I will then revert this commit the next day and get back to work
debugging the core problem with the Kokkos rebuild problem in #6855.
bartlettroscoe added a commit that referenced this issue Feb 20, 2020
bartlettroscoe added a commit that referenced this issue Feb 20, 2020
@bartlettroscoe bartlettroscoe added PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates ATDM Sev: Critical Problems that critically damage ability to run ATDM Trilinos builds much less allow APP updates and removed ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates labels Feb 20, 2020
@bartlettroscoe
Copy link
Member Author

@trilinos/kokkos,

FYI: More evidence in #7195 that Kokkos is not rebuilding correctly after the big CMake refactor.

I will add detailed instructions on how to reproduce the problem.

@jjwilke
Copy link

jjwilke commented Apr 19, 2020

Do we have reproducer instructions yet?

@bartlettroscoe
Copy link
Member Author

Do we have reproducer instructions yet?

I will get this updated when I can. Just have a bunch of other stuff right now.

@bartlettroscoe
Copy link
Member Author

Another example of rebuild problems in #7341.

I have to deal with #6840 then I will put in the reproducability instructions for the case I know it is repeatable on every platform.

@bartlettroscoe
Copy link
Member Author

We have not seen this problem in a long time. Perhaps it has been fixed? I could still provide reproducibility instructions but not sure what the point would be at this point.

@bartlettroscoe
Copy link
Member Author

CC: @crtrott

I think another example of this problem is documented in #8638 (comment) (expand TEST DETAILS and look at section ride). Trying to rebuild the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug on 'ride' showed the build error:

/ascldap/users/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/impl/Kokkos_Core.cpp(330): error: identifier "KOKKOS_VERSION" is undefined

1 error detected in the compilation of "/tmp/tmpxft_0000912b_00000000-6_Kokkos_Core.cpp1.ii".

I had to delete the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug and configure and build from scratch and the Kokkos-related errors went away. That is another pretty strong sign that there may be a problem with the Kokkos CMakeLists.txt files things don't always rebuild correctly, even when you run rm -r CMake* and configure from scratch. There is likely some bad logic there.

@bartlettroscoe
Copy link
Member Author

FYI: I just his this error after a local rebuild attempt of Trilinos:

[32/27570] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o
FAILED: packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o 
/projects/sems/install/rhel7-x86_64/sems/compiler/gcc/5.3.0/openmpi/1.10.1/bin/mpicxx  -Dkokkoscore_EXPORTS -I. -Ipackages/kokkos/core/src -I/home/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src -Ipackages/kokkos -isystem /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/5.3.0/openmpi/1.10.1/include -pedantic -Wall -Wno-long-long -Wwrite-strings  -fopenmp  -O3 -DNDEBUG -fPIC   -std=c++14 -MD -MT packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o -MF packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o.d -o packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Core.cpp.o -c /home/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src/impl/Kokkos_Core.cpp
/home/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src/impl/Kokkos_Core.cpp: In function ‘void Kokkos::Impl::{anonymous}::pre_initialize_internal(const Kokkos::InitArguments&)’:
/home/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src/impl/Kokkos_Core.cpp:330:58: error: ‘KOKKOS_VERSION’ was not declared in this scope
                                  version_string_from_int(KOKKOS_VERSION));

I had to blow away the entire build dir and configure from scratch to make this error go away. (Just deleting the packages/ directory was not enough which is interesting.)

After the big Kokkos CMake build system refactoring 1.5 years ago something went wrong that has broken rebuilds in some cases. Some dependency rule must not have been set correctly as part of that refactoring.

At some point, someone needs to take some time to debug what is happening and fix this. Before this, I can't remember the last time I had a build error with Trilinos doing a rebuild after reconfiguring from scratch (i.e. rm -r CMake* ; cmake ...). So much time has passed that it may be hard to come up with a rock solid reproducer now.

@github-actions
Copy link

github-actions bot commented Sep 4, 2022

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Sep 4, 2022
@github-actions
Copy link

github-actions bot commented Oct 5, 2022

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Oct 5, 2022
@github-actions github-actions bot closed this as completed Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Critical Problems that critically damage ability to run ATDM Trilinos builds much less allow APP updates client: ATDM Any issue primarily impacting the ATDM project CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants