Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get Trilinos working with CUDA-9.0 #1976

Closed
4 tasks
bartlettroscoe opened this issue Nov 9, 2017 · 40 comments
Closed
4 tasks

Get Trilinos working with CUDA-9.0 #1976

bartlettroscoe opened this issue Nov 9, 2017 · 40 comments
Labels
Framework tasks Framework tasks (used internally by Framework team) pkg: Kokkos

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Nov 9, 2017

CC: @trilinos/framework, @crtrott, @hcedwar, @nmhamster, @bathmatt, @micahahoward

Next Action Status

Done as part of #2706.

Description

This is a high-level issue to coordinate the efforts to get Trilinos working with CUDA 9.0 (CUDA9.0, CUDA-9.0). There is a driver for this from ATDM application developers (i.e. @bathmatt and @micahahoward).

The Coordinated DevOps for ATDM Story tracking this is:

but as many details as possible will be tracked in this issue.

NOTE: Other versions of CUDA 9.x should not be discussed.

Tasks:

  1. Verify that Kokkos is supporting CUDA 9.0
  2. Identify one or more computer systems at SNL where a CUDA 9.0 env can be accessed for Trilinos developers and to set up automated testing with Trilinos
  3. Produce a build of Trilinos with target CUDA 9.0 env and post to experimental build to CDash
  4. Verify that there exists (or create) Trilinos GitHub issues for the current set of problems with this CUDA 9.0 build (along with clear and easy reproducability instructions).

Related Issues

@bartlettroscoe bartlettroscoe added Framework tasks Framework tasks (used internally by Framework team) pkg: Kokkos labels Nov 9, 2017
@bartlettroscoe
Copy link
Member Author

@bathmatt and @nmhamster, please fill in the motivations above and other details about wanting Trilinos (mostly Kokkos) to work with CUDA 9.

@micahahoward
Copy link

Is there any progress on this?

@bartlettroscoe
Copy link
Member Author

@crtrott, does Kokkos now support CUDA 9? Is there automated testing for Kokkos with CUDA 9 on the Kokkos side to help support CUDA 9 builds of Trilinos using Kokkos? I think that is the foundation that we need to get Trilinos to start supporting CUDA 9.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Dec 11, 2017

I spoke with @micahahoward about the current status of this. This what I learned:

  • @micahahoward heard from @crtrott Kokkos developers are now regularly testing against some CUDA 9 build env (using GCC?)
  • Shiller/Hansen have GCC 4.9.3 + CUDA 9 envs where @bathmatt and @micahahoward are trying to build Trilinos.
  • The real target will be CUDA 9 with the XL compiler and ride/white

@micahahoward
Copy link

@mhoemmen will help with building the sparc/Trilinos version on Shiller with GCC 4.9.3 + CUDA 9.

@mhoemmen
Copy link
Contributor

LET ME AT IT :D

@mhoemmen
Copy link
Contributor

I'm more than halfway through the CUDA 9 Trilinos build on shiller. Lots of warnings (CUDA 9 apparently deprecated the shuffle functions in favor of "ballot" functions) but it looks OK thus far.

@mhoemmen
Copy link
Contributor

Confirmed: Intrepid2 breaks the CUDA 9 build. I will turn it off and try again.

@mhoemmen
Copy link
Contributor

FYI here's the module I loaded on shiller:

$ module load sparc-dev/gcc-4.9.3_cuda-9.0.176_openmpi-2.1.1

Just to verify that this is CUDA 9:

$ which nvcc
.../haswell-nvidia/cuda/9.0.176/bin/nvcc
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

@mhoemmen
Copy link
Contributor

FYI this is being tracked in Jira via CDOFA-22 (project "Coordinated DevOps for ATDM").

@mhoemmen
Copy link
Contributor

Huh, this is a fun warning:

[ 96%] Building CXX object packages/muelu/src/CMakeFiles/muelu.dir/Utils/ExplicitInstantiation/MueLu_CoarseningVisualizationFactory.cpp.o
...
.../Trilinos/packages/kokkos/core/src/Kokkos_NumericTraits.hpp(209): warning: 'long double' is treated as 'double' in device code

.../Trilinos/packages/kokkos/core/src/Kokkos_NumericTraits.hpp(210): warning: 'long double' is treated as 'double' in device code

@trilinos/muelu @jhux2

@mhoemmen
Copy link
Contributor

mhoemmen commented Dec 14, 2017

@micahahoward I got through Trilinos' CUDA 9 build, using your shiller script just with the line Trilinos_ENABLE_Intrepid2:BOOL=OFF added. I'll test the app build now.

@mhoemmen
Copy link
Contributor

@micahahoward The app builds, yay! I'll have to figure out how to run tests on shiller, but mainly I'd like to run Trilinos tests on shiller. Note that I had to turn off Intrepid2 to get Trilinos to build.

@micahahoward
Copy link

@mhoemmen that's good news.

Since this turned out to be an Intrepid2 problem and we don't currently use Intrepid2, we aren't impeded in using CUDA 9. However, getting Trilinos built and testing with CUDA 9 is important to us (we're just indifferent to whether or not that includes Intrepid2). In the interest of having a common configuration that covers multiple apps, someone should keep the pressure on to get Intrepid2 fixed. @bathmatt or @bartlettroscoe ?

@rppawlo
Copy link
Contributor

rppawlo commented Dec 14, 2017

In the interest of having a common configuration that covers multiple apps, someone should keep the pressure on to get Intrepid2 fixed.

@mperego @kyungjoo-kim - not sure if you two saw this. Could one of you take a look at this? CUDA 9 is not working for Intrepid2.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Dec 14, 2017

Could you post error message with cuda 9 ? Which machine does have cuda 9 with other modules ?

@ndellingwood
Copy link
Contributor

@kyungjoo-kim I think it will be similar to #1928 (using the kokkos develop branch pre-CMake changes); I believe using the kokkos-promotion branch of Trilinos with the current kokkos develop branch should reproduce errors. To do so, you can create a symbolic link to kokkos in the base directory of Trilinos then add the following line to your configure script:
-D Kokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \

In addition, the CMake/Makefile changes occurring with next kokkos-promotion will require you remove the previously needed CMAKE_CXX_FLAGS required for cuda and specify the architecture to compile, for example:
-D KOKKOS_ARCH="HSW;Kepler35" \
I previously reproduced on Hansen and can help if you like.

@kyungjoo-kim
Copy link
Contributor

@ndellingwood This error is not related to new kokkos promotion. First thing that I want to check is if current Kokkos in Trilinos develop supports CUDA 9 or not. If not I would like to hold this problem until kokkos officially supports CUDA 9.

Next question is if kokkos new promotion supports CUDA 9. In that case, I can still check the problem with the new kokkos branch.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim I wasn't suggesting the error was due to the kokkos-promotion, it preexisted, I was just suggesting steps for testing with the version of Kokkos that supports Cuda 9. Once the kokkos promotion completes Trilinos will have the version of Kokkos that supports Cuda 9.

@kyungjoo-kim
Copy link
Contributor

@ndellingwood Can you help me some time afternoon ? I cannot compile cuda 9 on hansen.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim sure :)

@mperego
Copy link
Contributor

mperego commented Dec 14, 2017

@kyungjoo-kim @ndellingwood, thanks for looking into this. I'm on travel but I'll be back on Monday.

@kyungjoo-kim
Copy link
Contributor

The error is very weird and @ndellingwood @hcedwar and I suspect it as a compiler error. We tried to reproduce the error in a simple code but it is not reproduced. @hcedwar mentioned that @nmhamster has a machine with cuda 9.1 and suggest to see if this happens with the cuda 9.1. We will update this after testing on the machine.

@Rombur
Copy link

Rombur commented Jan 2, 2018

@kyungjoo-kim Did you get a chance to try again with cuda 9.1? We are waiting on Intrepid2 to update our tester to cuda 9.

@kyungjoo-kim
Copy link
Contributor

No, we did not get a machine with cuda 9.1 yet.

@nmhamster Could you give us a time line for the cuda 9.1 machine ?

@kyungjoo-kim
Copy link
Contributor

@Rombur

Before I get it tested with cuda 9.1, I did some trials and errors to see what triggers the problem. What I think that I found is a compiler bug. When I put const in the input arguments, I got an template parameter packing errors.

I don't know why only intrepid2 encounter this error but I can reproduce the error with the following code.

@nmhamster @crtrott @ndellingwood could you help me figure out this problem more specificially ?

#include "Kokkos_Core.hpp"
#include "Kokkos_DynRankView.hpp"

template<typename SpT, typename VT>
struct my_fake_basis {
  typedef Kokkos::DynRankView<VT,SpT> output_view_type;
  typedef Kokkos::DynRankView<VT,SpT> input_view_type;

  my_fake_basis() = default;
  void get_values(output_view_type A) {
    printf("this is okay\n\n");
  }

// after uncomment the following codes, it shows compiler error
//  void get_values(output_view_type A,
//                  const input_view_type B) {
//    printf("this is NOT okay with cuda 9\n\n");
//  }
};

template<typename SpT, typename VT>
void my_test() {
  my_fake_basis<SpT,VT> basis;
}

int main(int argc, char *argv[]) {
  my_test<Kokkos::Serial,double>();
  return 0;
}

// error message
// error: expansion pattern ‘SpT’ contains no argument packs

@ndellingwood
Copy link
Contributor

Verified the code above (with Kokkos::initalize() and finalize() calls added in the main function) fails to compile with Kokkos' master branch but does compile with the develop branch.

@ndellingwood
Copy link
Contributor

ndellingwood commented Jan 8, 2018

Should be encouraging news: Intrepid2 compiles and all unit tests pass with Cuda 9.0 and gcc/4.9.3 when using the current Kokkos develop branch (sha f27d189) along with some removal of DynRankView from Experimental namespace in Sacado to match cleanup in [kokkos/kokkos#1293](Kokkos kokkos/kokkos#1293).

#1928 contains additional details and Sacado patch.

@Rombur
Copy link

Rombur commented Jan 9, 2018

Thanks for looking into it @kyungjoo-kim @ndellingwood

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 6, 2018

So it appears there are no automated builds of CUDA-9 that submit to CDash. The new ATDM builds of Trilinos matching the EMPIRE builds of Trilinos are using CUDA-8.

Any reason we can't set up some CUDA-9.0 builds, at least one or two to start with? Can the Trilinos development community support this? I am asking because @micahahoward specifically requested that we set up and support CUDA-9.0 builds and there are currently build failures with that.

@jwillenbring
Copy link
Member

@william76 This might be a good target for setting up a CUDA build. Let’s discuss this today.

@bartlettroscoe Are there cycles available on waterman or another appropriate machine.

@bmpersc Is waterman available through Jenkins yet? I have not heard that it is, but I figured I would ask.

@bmpersc
Copy link
Contributor

bmpersc commented Mar 6, 2018

@jwillenbring, yes waterman is available on jenkins.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 7, 2018

@bartlettroscoe Are there cycles available on waterman or another appropriate machine.

@nmhamster, is waterman a machine we might consider for setting up CUDA 9.0 builds as requested by SPARC?

@nmhamster
Copy link
Contributor

@bartlettroscoe - there are no CUDA-9.0 environments built for Waterman. We would not recommend this platform for testing this combination.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe - there are no CUDA-9.0 environments built for Waterman. We would not recommend this platform for testing this combination.

@micahahoward, where is the machine where you are trying to get Trilinos working with CUDA 9.0?

@nmhamster
Copy link
Contributor

@bartlettroscoe / @micahahoward - we might want to take this conversation away from public Github and discuss over email.

For CUDA 9.0, the code teams are expected to use either ride or shiller at this stage.

@bartlettroscoe bartlettroscoe changed the title Get Trilinos working with CUDA-9 Get Trilinos working with CUDA-9.0 Mar 7, 2018
@bartlettroscoe
Copy link
Member Author

Sounds like the CUDA 8 envs on the test bed machines are going to go away soon so there is some urgency to this. It seems there has been a CUDA 9.0 env on shiller for some time (just realized that tonight after @nmhamster pointed that out more explicitly). Since the new ATDM Trilinos build is based on the EMPIRE builds of Trilinos and all of those builds seem to be using CUDA 8 so far, we have not seen any CUDA 9.0 builds yet.

Given this new info and urgency, I will try to set up the new ATDM Trilinos build with this CUDA 9.0 env tomorrow and post to CDash. But it will be up to the Trilinos develop team to work through all of the new problems that might be exposed fairly rapidly. If we don't, the ATDM APP codes are going to be stuck with a broken Trilinos on CUDA 9.0 envs and will be dead in the water until we can fix Trilinos to work with CUDA 9.0.

@jwillenbring
Copy link
Member

@william76

Information above concerning CUDA 8 vs 9 is important for your effort to establish a CUDA build for PR testing.

@bartlettroscoe
Copy link
Member Author

FYI: This is getting worked in #2706. See details there.

@bartlettroscoe
Copy link
Member Author

This was already done as part of #2706. We have a completely cleaned up ATDM-focused build of Trilinos for CUDA 9.0 running on 'hansen'/'shiller'.

Closing as complete.

The next step will be getting a CUDA 9.2 build of Trilinos going. For that, I created:

If there are specific Trilinos issues (and not just system issues on 'waterman'), then we will create new Trilinos GitHub issues for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework tasks Framework tasks (used internally by Framework team) pkg: Kokkos
Projects
None yet
Development

No branches or pull requests