-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get Trilinos working with CUDA-9.0 #1976
Comments
@bathmatt and @nmhamster, please fill in the motivations above and other details about wanting Trilinos (mostly Kokkos) to work with CUDA 9. |
Is there any progress on this? |
@crtrott, does Kokkos now support CUDA 9? Is there automated testing for Kokkos with CUDA 9 on the Kokkos side to help support CUDA 9 builds of Trilinos using Kokkos? I think that is the foundation that we need to get Trilinos to start supporting CUDA 9. |
I spoke with @micahahoward about the current status of this. This what I learned:
|
@mhoemmen will help with building the sparc/Trilinos version on Shiller with GCC 4.9.3 + CUDA 9. |
LET ME AT IT :D |
I'm more than halfway through the CUDA 9 Trilinos build on shiller. Lots of warnings (CUDA 9 apparently deprecated the shuffle functions in favor of "ballot" functions) but it looks OK thus far. |
Confirmed: Intrepid2 breaks the CUDA 9 build. I will turn it off and try again. |
FYI here's the module I loaded on shiller:
Just to verify that this is CUDA 9:
|
FYI this is being tracked in Jira via CDOFA-22 (project "Coordinated DevOps for ATDM"). |
Huh, this is a fun warning:
@trilinos/muelu @jhux2 |
@micahahoward I got through Trilinos' CUDA 9 build, using your shiller script just with the line |
@micahahoward The app builds, yay! I'll have to figure out how to run tests on shiller, but mainly I'd like to run Trilinos tests on shiller. Note that I had to turn off Intrepid2 to get Trilinos to build. |
@mhoemmen that's good news. Since this turned out to be an Intrepid2 problem and we don't currently use Intrepid2, we aren't impeded in using CUDA 9. However, getting Trilinos built and testing with CUDA 9 is important to us (we're just indifferent to whether or not that includes Intrepid2). In the interest of having a common configuration that covers multiple apps, someone should keep the pressure on to get Intrepid2 fixed. @bathmatt or @bartlettroscoe ? |
@mperego @kyungjoo-kim - not sure if you two saw this. Could one of you take a look at this? CUDA 9 is not working for Intrepid2. |
Could you post error message with cuda 9 ? Which machine does have cuda 9 with other modules ? |
@kyungjoo-kim I think it will be similar to #1928 (using the kokkos develop branch pre-CMake changes); I believe using the kokkos-promotion branch of Trilinos with the current kokkos develop branch should reproduce errors. To do so, you can create a symbolic link to kokkos in the base directory of Trilinos then add the following line to your configure script: In addition, the CMake/Makefile changes occurring with next kokkos-promotion will require you remove the previously needed CMAKE_CXX_FLAGS required for cuda and specify the architecture to compile, for example: |
@ndellingwood This error is not related to new kokkos promotion. First thing that I want to check is if current Kokkos in Trilinos develop supports CUDA 9 or not. If not I would like to hold this problem until kokkos officially supports CUDA 9. Next question is if kokkos new promotion supports CUDA 9. In that case, I can still check the problem with the new kokkos branch. |
@kyungjoo-kim I wasn't suggesting the error was due to the kokkos-promotion, it preexisted, I was just suggesting steps for testing with the version of Kokkos that supports Cuda 9. Once the kokkos promotion completes Trilinos will have the version of Kokkos that supports Cuda 9. |
@ndellingwood Can you help me some time afternoon ? I cannot compile cuda 9 on hansen. |
@kyungjoo-kim sure :) |
@kyungjoo-kim @ndellingwood, thanks for looking into this. I'm on travel but I'll be back on Monday. |
The error is very weird and @ndellingwood @hcedwar and I suspect it as a compiler error. We tried to reproduce the error in a simple code but it is not reproduced. @hcedwar mentioned that @nmhamster has a machine with cuda 9.1 and suggest to see if this happens with the cuda 9.1. We will update this after testing on the machine. |
@kyungjoo-kim Did you get a chance to try again with cuda 9.1? We are waiting on Intrepid2 to update our tester to cuda 9. |
No, we did not get a machine with cuda 9.1 yet. @nmhamster Could you give us a time line for the cuda 9.1 machine ? |
Before I get it tested with cuda 9.1, I did some trials and errors to see what triggers the problem. What I think that I found is a compiler bug. When I put I don't know why only intrepid2 encounter this error but I can reproduce the error with the following code. @nmhamster @crtrott @ndellingwood could you help me figure out this problem more specificially ?
|
Verified the code above (with |
Should be encouraging news: Intrepid2 compiles and all unit tests pass with Cuda 9.0 and gcc/4.9.3 when using the current Kokkos develop branch (sha f27d189) along with some removal of #1928 contains additional details and Sacado patch. |
Thanks for looking into it @kyungjoo-kim @ndellingwood |
So it appears there are no automated builds of CUDA-9 that submit to CDash. The new ATDM builds of Trilinos matching the EMPIRE builds of Trilinos are using CUDA-8. Any reason we can't set up some CUDA-9.0 builds, at least one or two to start with? Can the Trilinos development community support this? I am asking because @micahahoward specifically requested that we set up and support CUDA-9.0 builds and there are currently build failures with that. |
@william76 This might be a good target for setting up a CUDA build. Let’s discuss this today. @bartlettroscoe Are there cycles available on waterman or another appropriate machine. @bmpersc Is waterman available through Jenkins yet? I have not heard that it is, but I figured I would ask. |
@jwillenbring, yes waterman is available on jenkins. |
@nmhamster, is waterman a machine we might consider for setting up CUDA 9.0 builds as requested by SPARC? |
@bartlettroscoe - there are no CUDA-9.0 environments built for Waterman. We would not recommend this platform for testing this combination. |
@micahahoward, where is the machine where you are trying to get Trilinos working with CUDA 9.0? |
@bartlettroscoe / @micahahoward - we might want to take this conversation away from public Github and discuss over email. For CUDA 9.0, the code teams are expected to use either ride or shiller at this stage. |
Sounds like the CUDA 8 envs on the test bed machines are going to go away soon so there is some urgency to this. It seems there has been a CUDA 9.0 env on shiller for some time (just realized that tonight after @nmhamster pointed that out more explicitly). Since the new ATDM Trilinos build is based on the EMPIRE builds of Trilinos and all of those builds seem to be using CUDA 8 so far, we have not seen any CUDA 9.0 builds yet. Given this new info and urgency, I will try to set up the new ATDM Trilinos build with this CUDA 9.0 env tomorrow and post to CDash. But it will be up to the Trilinos develop team to work through all of the new problems that might be exposed fairly rapidly. If we don't, the ATDM APP codes are going to be stuck with a broken Trilinos on CUDA 9.0 envs and will be dead in the water until we can fix Trilinos to work with CUDA 9.0. |
Information above concerning CUDA 8 vs 9 is important for your effort to establish a CUDA build for PR testing. |
FYI: This is getting worked in #2706. See details there. |
This was already done as part of #2706. We have a completely cleaned up ATDM-focused build of Trilinos for CUDA 9.0 running on 'hansen'/'shiller'. Closing as complete. The next step will be getting a CUDA 9.2 build of Trilinos going. For that, I created: If there are specific Trilinos issues (and not just system issues on 'waterman'), then we will create new Trilinos GitHub issues for those. |
CC: @trilinos/framework, @crtrott, @hcedwar, @nmhamster, @bathmatt, @micahahoward
Next Action Status
Done as part of #2706.
Description
This is a high-level issue to coordinate the efforts to get Trilinos working with CUDA 9.0 (CUDA9.0, CUDA-9.0). There is a driver for this from ATDM application developers (i.e. @bathmatt and @micahahoward).
The Coordinated DevOps for ATDM Story tracking this is:
but as many details as possible will be tracked in this issue.
NOTE: Other versions of CUDA 9.x should not be discussed.
Tasks:
Related Issues
The text was updated successfully, but these errors were encountered: