Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP: CMake work on tests and ETI to truely enable the backend #820

Merged
merged 5 commits into from
Nov 3, 2020

Conversation

lucbv
Copy link
Contributor

@lucbv lucbv commented Sep 30, 2020

The goal is to make sure the HIP backend in properly instanciated
by ETI and tested in the unit-test using the Kokkos::Experimental::HIP
and Kokkos::Experimental::HIPSpace execution and memory spaces.

@lucbv lucbv requested a review from jjwilke September 30, 2020 16:13
@lucbv lucbv self-assigned this Sep 30, 2020
@lucbv lucbv mentioned this pull request Sep 30, 2020
Copy link
Contributor

@ndellingwood ndellingwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv Several of the BLAS tests in unit_test/hip have Cuda in the filename and should be renamed with Hip (e.g. unit_test/hip/Test_Cuda_Blas2_gemv.cpp)

@lucbv lucbv force-pushed the HIP_CMake_work branch 2 times, most recently from 569fdd0 to 1f0e85a Compare October 1, 2020 23:51
@lucbv
Copy link
Contributor Author

lucbv commented Oct 2, 2020

@ndellingwood this was meant to be WIP so I can show @jjwilke where I'm at and fix some CMake stuff.
Not all the unit-test are building so I am keeping some commented out, I am thinking of adding some CMake flags to selectively enable/disable some tests at configure time to allow for incremental progress.
But since Kokkos is ahead of me with rock-3.8.0 I think I won't try to push this further until it is installed on caraway.

Once this is done we should be about finished with HIP backend though : )

@brian-kelley
Copy link
Contributor

@lucbv @jjwilke I'm working on a branch to add HIP code paths to algorithms, perf tests, examples, etc. For example, limiting vector length to 64 instead of 32. I won't touch anything related to ETI, but when that's done we should be close to having tests pass on caraway.

@lucbv
Copy link
Contributor Author

lucbv commented Oct 6, 2020

@brian-kelley just double checking, you are doing that based by working on top of this PR or are you working on top of develop?

@brian-kelley
Copy link
Contributor

brian-kelley commented Oct 6, 2020

@lucbv I started out today working from develop, but I can rebase on your branch. Is the plan to only merge HIP work into develop when it all works? If we're ok with having different pieces go in without 100% working, I think this PR could just go in.

(by that, I mean that adding files that only compile with HIP enabled shouldn't break anyone's build)

@lucbv
Copy link
Contributor Author

lucbv commented Oct 6, 2020

@brian-kelley
I am debating the idea of merging this but I think it would make sense to see if we can get the PR by @jjwilke first and then rebase this PR on top of it so we can selectively enable tests for HIP builds.
Btw I think that develop builds and runs just fine on caraway right now, but if you grab the HEAD of this branch it will be a different story...
Do you want to meet and talk about this work?

@brian-kelley
Copy link
Contributor

@lucbv I will just finish the branch I have now, then when Jeremy's ETI stuff goes in, we can combine our stuff and rebase on that. We could talk about this tomorrow. Today I got through batched, blas, graph, common (sparse was going to be the trickiest).

@lucbv
Copy link
Contributor Author

lucbv commented Oct 20, 2020

@ndellingwood this version should be much closer to the final one.
I would like to try to build it against the current develop branch of kokkos and see how much farther it gets, some improvements were made since they upgraded to rocm/3.8.0

@lucbv
Copy link
Contributor Author

lucbv commented Oct 21, 2020

@ndellingwood thanks for the look, it's actually a good thing, I just used a script to automatically get all the Cuda unit-test into the HIP folder and then changed the header inclusion. Maybe I should actually do a grep for Cuda in the HIP unit-test folder just to see what comes out of it?
I'll address your comments : )

@lucbv lucbv force-pushed the HIP_CMake_work branch 2 times, most recently from d8229bd to 7bdfe10 Compare October 29, 2020 03:57
@lucbv
Copy link
Contributor Author

lucbv commented Oct 29, 2020

@srajama1 @ndellingwood @brian-kelley @jjwilke
This work is ready for review, I have started a spot_check on Kokkos-dev2 that could reveal something new after changes I made to the unit-tests this morning (spot-check I ran last night was clean).
The main thing left to make this completely clean will be to apply the changes that @jjwilke has worked on in PR #835 that will allow me to uncomment some tests that are currently not building.

Regarding what works and what does not after this PR:

  1. The library builds on caraway (node 04 with rocm/3.8.0 at least)
  2. The ETI is in place and instantiate the library correctly on Kokkos::Experimental::HIP and Kokkos::Experimental::HIPSpace
  3. All the perf_tests are building correctly
  4. The Blas and Sparse tests are building correctly, other tests will be enabled at a later time
  5. Overall both Blas and Sparse unit-tests are failing but in detail a good chunk of sub tests in gtest are passing, some work to clean the rest will be needed
  6. The wiki tests are passing but the default types used there might not be appropriate, need to double check. -> actually checked and @brian-kelley modified that correctly so it runs on HIP
  7. Running the perf_tests was not attempted yet.

One thing to do after this PR will be to modify the spot-check so that it can run on caraway. While it cannot check for passing tests, just checking for compilation would already be helpful. Especially so since rocm/3.8.0 is using clang-11 as compiler and we do not test with it (currently kokkos-dev2 tests with clang 8.0 and clang 9.0.0).

Please let me know if this makes sense or if you have a strong opinion like: Blas or Sparse should pass before merging?

@brian-kelley
Copy link
Contributor

I would say, as long as it doesn't break serial/cuda/openmp we should just merge anything that gets closer to HIP working. It's easier than having several diverging branches of develop with the updates. We just shouldn't advertise to users that they should try hip yet :)

For question 6 you had, I just double checked in all the wiki examples that they are calling HIP, which is the DefaultExecutionSpace in a HIP+Serial build.

For perf tests, all the graph ones seem to work completely. I just tried several of the D1 and D2 coloring algorithms on some pretty big graphs, and mis2 also works. Coloring is producing a plausible color array and mis2 can verify the output. I did miss the D1 coloring when adding HIP support but I have a PR #841 for that.

@lucbv
Copy link
Contributor Author

lucbv commented Oct 29, 2020

Yeah that's my philosophy too, let's see what others say.
Also here are the results from kokkos-dev2 spot-check:

#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=692 run_time=147
clang-8.0-Pthread_Serial-release build_time=235 run_time=131
clang-9.0.0-Pthread-release build_time=150 run_time=62
clang-9.0.0-Serial-release build_time=160 run_time=48
cuda-10.1-Cuda_OpenMP-release build_time=838 run_time=146
cuda-11.0-Cuda_OpenMP-release build_time=895 run_time=308
cuda-9.2-Cuda_Serial-release build_time=813 run_time=215
gcc-7.3.0-OpenMP-release build_time=166 run_time=47
gcc-7.3.0-Pthread-release build_time=130 run_time=62
gcc-8.3.0-Serial-release build_time=159 run_time=50
gcc-9.1-OpenMP-release build_time=206 run_time=47
gcc-9.1-Serial-release build_time=175 run_time=55
intel-17.0.1-Serial-release build_time=404 run_time=52
intel-18.0.5-OpenMP-release build_time=681 run_time=49
intel-19.0.5-Pthread-release build_time=357 run_time=64

Copy link
Contributor

@brian-kelley brian-kelley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HIP device should be HIP, not HIPSpace (see 2 comments)

@lucbv
Copy link
Contributor Author

lucbv commented Oct 29, 2020

@brian-kelley I find the exec space issue and enable common and graph tests!

@brian-kelley brian-kelley mentioned this pull request Oct 29, 2020
by ETI and tested in the unit-test using the Kokkos::Experimental::HIP
and Kokkos::Experimental::HIPSpace execution and memory spaces.

Status of the unit-tests:
Batched -> does not build
Blas    -> builds and Blas1 test pass, others fail
Common  -> does not build
Graph   -> does not build
Sparse  -> builds but does few tests run before crash

Fix blas and blas3 perf_test:
these perf_test use parallel_for to run multiple tests at once
while on host and call a host only function to do so.
These are now being skipped when KOKKOS_ENABLE_HIP is on.
All performance test are building.
lucbv added 3 commits October 29, 2020 23:35
TestExecSpace was set to HIPSpace instead of HIP creating
problems at build and execution time in unit_test.
We can now enable common and graph unit-tests for build.
common_hip is passing without failures.
The min launch bound was set to 2 and is now set to 0.
This new setting allows the BLAS unit-test to successfully run
to completion.
With the new modifications the spot-check can be run on caraway04
where BLAS and Common unit-tests will be built and tested as well
as the wiki tests.
@lucbv
Copy link
Contributor Author

lucbv commented Oct 30, 2020

With a bit more work done in that last couple commits we now have the spot-check building and testing BLAS and Common unit-tests and wiki examples successfully on caraway04.

caraway04 spot-check

#######################################################
PASSED TESTS
#######################################################
rocm-3.8.0-Hip_Serial-release build_time=667 run_time=121

kokkos-dev2 spot-check

#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=740 run_time=733
clang-8.0-Pthread_Serial-release build_time=270 run_time=144
clang-9.0.0-Pthread-release build_time=165 run_time=71
clang-9.0.0-Serial-release build_time=160 run_time=48
cuda-10.1-Cuda_OpenMP-release build_time=876 run_time=149
cuda-11.0-Cuda_OpenMP-release build_time=906 run_time=153
cuda-9.2-Cuda_Serial-release build_time=856 run_time=228
gcc-7.3.0-OpenMP-release build_time=164 run_time=48
gcc-7.3.0-Pthread-release build_time=132 run_time=64
gcc-8.3.0-Serial-release build_time=163 run_time=50
gcc-9.1-OpenMP-release build_time=207 run_time=49
gcc-9.1-Serial-release build_time=199 run_time=51
intel-17.0.1-Serial-release build_time=393 run_time=54
intel-18.0.5-OpenMP-release build_time=701 run_time=51
intel-19.0.5-Pthread-release build_time=391 run_time=72

@srajama1
Copy link
Contributor

All , @lucbv @brian-kelley @jjwilke : First, thanks for all the progress on HIP backend. Let me review the actual changes next. In principle I would like incremental changes over one massive change. As long as we can clearly identify the failures (turn them off) create an issue with all of them so we can make progress on those, we can push this to develop. I do not want to advertise HIP support unless we have reached a point we have clean tests (except when there are external issues from TPLs or compilers and we can disable tests for those). On to code now.

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code wise looks ok. My major comments :

  • please file issues or one master issue for all the problems even if there is a workaround now.
  • can we avoid 100+ files for every backend ?

@@ -292,7 +292,7 @@ void __do_trmm_serial_batched(options_t options, trmm_args_t trmm_args) {
return;
}

#if !defined(KOKKOS_ENABLE_CUDA)
#if !defined(KOKKOS_ENABLE_CUDA) && !defined(KOKKOS_ENABLE_HIP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these basically "host only" checks ? Then we can add one check as host only and not worry about this for Intel SYCL and OpenMP target backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we can definitely create a KOKKOSKERNELS_HOST_ONLY macro that stores that information and is used in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see issue #843

// yield an optimal_num_blocks=0 which means no ressources
// are allocated... Switching to LaunchBounds<384,2> fixes
// that problem but I'm not sure if that it a good perf
// parameter or why it is set to 2 for Cuda?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File a Kokkos issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll talk about it with Damien and Daniel, some parameters are inherently different with AMD like the wrap size so this could be something that needs to be different depending on the architecture?

# COMPONENTS graph
# )

#currently float 128 test is not working. So common tests are explicitly added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need to be in list of issues to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srajama1 Note that float128 tests are also diabled for CUDA, so it's generally just not supported now on GPU. I think because of that this should be lower priority than the other tests

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it.

@@ -0,0 +1,3 @@
#include "Test_HIP.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these test files needed ? I am worried we are going in the direction of the ETI again. This is going to blow up with OpenMP target and Intel SYCL. Can we do same trick as we did for ETI, where we generate these files if unit tests are ON ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes something like that or having all the headers for the different backends in a single file and use guards to enable them selectively at compile time with different executably names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see issue #845

@rsiva-web
Copy link

Can I merge this with the assumption above comments will be filed as an issue ? It would be nice for this to go in the next release.

@lucbv
Copy link
Contributor Author

lucbv commented Nov 2, 2020

Yes feel free to merge, I am filling new issues for the points raised here.

@srajama1 srajama1 merged commit 9e338ef into kokkos:develop Nov 3, 2020
@lucbv lucbv deleted the HIP_CMake_work branch November 3, 2020 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants