Add HIP support to src/ and perf_test/ #828

brian-kelley · 2020-10-13T18:25:43Z

Where possible, make execution space handling generic across all GPU-like devices.

@lucbv Since these changes don't break existing supported devices, I say we just merge it now instead of waiting for ETI and caraway to update.

#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=781 run_time=181
clang-8.0-Pthread_Serial-release build_time=260 run_time=130
clang-9.0.0-Pthread-release build_time=161 run_time=61
clang-9.0.0-Serial-release build_time=156 run_time=49
cuda-10.1-Cuda_OpenMP-release build_time=932 run_time=155
cuda-11.0-Cuda_OpenMP-release build_time=1077 run_time=1083
cuda-9.2-Cuda_Serial-release build_time=864 run_time=197
gcc-7.3.0-OpenMP-release build_time=164 run_time=253
gcc-7.3.0-Pthread-release build_time=142 run_time=55
gcc-8.3.0-Serial-release build_time=161 run_time=48
gcc-9.1-OpenMP-release build_time=206 run_time=97
gcc-9.1-Serial-release build_time=181 run_time=46
intel-17.0.1-Serial-release build_time=359 run_time=55
intel-18.0.5-OpenMP-release build_time=652 run_time=48
intel-19.0.5-Pthread-release build_time=332 run_time=67

lucbv

So as a general comment, from the point of view of the person reviewing this PR, I would have liked it if it could have a been a few smaller PRs, this was real work reading through it x)

I noted at least on opportunity to factor out a little helper function for out which appeared three or four times in the code.

Overall a lot of changes revolve around the use of:

kk_is_gpu_exec_space<exec_space>()
kk_get_suggested_vector_size()
kk_get_free_total_memory<mem_space>(free_mem, total_mem)

I think these functions should be advertised a bit, maybe during the next stand-up so they can be adopted more by others in the team

Finally the changes in SpMV could have been a PR on their own and need more investigation in terms of performance impact, although as the architectures we care about have changed this might be a good time to make these changes.

perf_test/graph/KokkosGraph_triangle.cpp

src/batched/KokkosBatched_Vector_SIMD.hpp

src/batched/KokkosBatched_Vector_SIMD_Arith.hpp

src/blas/impl/KokkosBlas3_gemm_impl.hpp

src/common/KokkosKernels_BitUtils.hpp

src/sparse/impl/KokkosSparse_spgemm_impl_triangle_no_compression.hpp

src/sparse/impl/KokkosSparse_spgemm_jacobi_sparseacc_impl.hpp

src/sparse/impl/KokkosSparse_partitioning_impl.hpp

lucbv · 2020-10-14T04:16:28Z

src/sparse/impl/KokkosSparse_spmv_impl.hpp

-             iEntry ++)
-#endif
-      {
+    Kokkos::parallel_for(Kokkos::ThreadVectorRange(dev, row.length),


So I tried to make these changes in the past but I think on some architecture, maybe intel KNL with intel compiler it resulted in a loss of performance and we canned it.
Let's see what @srajama1 says about it now, do we still care about KNL performance or can we make my dream come try and get rid of the ugly pragmas that Kokkos should help us avoid?

Same comment pretty much goes for the rest of the changes in the file

@lucbv I'm happy to rerun some benchmarks before & after this change. I saw some comments like "// This should be a thread loop as soon as we can use C++11" so I figured this would be converted to 3-level parallelism at some point and I didn't know performance was a reason to keep it as normal for loops.

I'm pretty sure Kokkos's parallel_for over ThreadVectorRange looks like this for OpenMP:

template <typename iType, class Closure, class Member> KOKKOS_INLINE_FUNCTION void parallel_for( Impl::ThreadVectorRangeBoundariesStruct<iType, Member> const& loop_boundaries, Closure const& closure, typename std::enable_if<Impl::is_host_thread_team_member<Member>::value>:: type const** = nullptr) { #ifdef KOKKOS_ENABLE_PRAGMA_IVDEP #pragma ivdep #endif for (iType i = loop_boundaries.start; i < loop_boundaries.end; i += loop_boundaries.increment) { closure(i); } }

so after inlining, I would hope that it basically turns into a regular for loop with ivdep. Then it's basically the original code but without unroll and loop count pragmas.

src/sparse/impl/KokkosSparse_spmv_struct_impl.hpp

brian-kelley · 2020-10-14T19:50:43Z

@lucbv Thanks a lot for the thorough review. I resolved the things here apart from performance changes in SpMV and I'm re running the tests.

Update:
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=657 run_time=124
clang-8.0-Pthread_Serial-release build_time=258 run_time=124
clang-9.0.0-Pthread-release build_time=145 run_time=54
clang-9.0.0-Serial-release build_time=152 run_time=45
cuda-10.1-Cuda_OpenMP-release build_time=880 run_time=125
cuda-11.0-Cuda_OpenMP-release build_time=846 run_time=123
cuda-9.2-Cuda_Serial-release build_time=864 run_time=173
gcc-7.3.0-OpenMP-release build_time=156 run_time=44
gcc-7.3.0-Pthread-release build_time=141 run_time=56
gcc-8.3.0-Serial-release build_time=153 run_time=47
gcc-9.1-OpenMP-release build_time=190 run_time=44
gcc-9.1-Serial-release build_time=176 run_time=46
intel-17.0.1-Serial-release build_time=341 run_time=50
intel-18.0.5-OpenMP-release build_time=597 run_time=47
intel-19.0.5-Pthread-release build_time=290 run_time=55

ndellingwood · 2020-10-21T23:02:45Z

@lucbv checking in to see if this is ready for merge following review comments?

lucbv · 2020-10-21T23:16:19Z

@ndellingwood sorry I have not had time to look at @brian-kelley responses, I will get it done tonight or tomorrow, thanks for reminding me!

brian-kelley · 2020-10-22T18:31:52Z

@lucbv Sorry I haven't finished this yet, the only thing still to do is see if using TeamThread/ThreadVector loops in SpMV caused performance regressions. I made changes based on all your other comments, though.

lucbv

@brian-kelley I just hav one question remaining here.
Thanks for addressing my previous questions.

lucbv · 2020-10-26T15:48:59Z

src/blas/impl/KokkosBlas3_gemm_impl.hpp

@@ -392,7 +399,7 @@ KOKKOS_INLINE_FUNCTION
 void impl_team_gemm_block(const TeamHandle& team, const ViewTypeC& C, const ViewTypeA& A, const ViewTypeB& B) {
  typedef typename ViewTypeC::non_const_value_type ScalarC;
 // GNU COMPILER BUG WORKAROUND
-#if defined(KOKKOS_COMPILER_GNU) || !defined(__CUDA_ARCH__)
+#if defined(KOKKOS_COMPILER_GNU) && (!defined(__CUDA_ARCH__) || !defined(__HIP_DEVICE_COMPILE__))


Is this correct, it seems that you might want to replace && should be replaced with ||

I am totally fine if the new behavior is what we want to implement, I just wanted to point out that the logic probably changed here

@lucbv I noticed this was different than most of the other GCC bug workarounds involving const, which used defined(KOKKOS_COMPILER_GNU) && !defined(__CUDA_ARCH__). || would mean "anything that's not CUDA device code" which I don't think is what was intended.

Yeah that makes sense to me, I just wanted to confirm as this was clearly changing the logic.
I'm good with this PR.

lucbv

I am fine with the changes in this PR, thanks @brian-kelley for all this work!

Used to be a normal for loop, now it's a ThreadVectorRange

It's __HIP_DEVICE_COMPILE__, not __CUDA_ARCH__.

(same code used in 7 places)

lucbv

This is fine with me

lucbv · 2020-10-27T16:21:41Z

@brian-kelley I see that you are adding more thing to this PR, let me know when you want a new round of review

brian-kelley · 2020-10-27T16:28:48Z

@lucbv Right now I'm just re-running the spot checks for the SpMV changes (RangePolicy for non-GPU) and just fixing the minor errors that come up. Basically I think with your last review you've seen everything.

(For some more about that, I did actually see a ~10% slowdown for some combinations of mode, layout and rank on CTS-1, but switching to RangePolicy made all of them at least as fast as before).

brian-kelley · 2020-10-27T18:55:49Z

Last round of testing:
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=714 run_time=135
clang-8.0-Pthread_Serial-release build_time=312 run_time=155
clang-9.0.0-Pthread-release build_time=160 run_time=62
clang-9.0.0-Serial-release build_time=162 run_time=47
cuda-10.1-Cuda_OpenMP-release build_time=884 run_time=135
cuda-11.0-Cuda_OpenMP-release build_time=902 run_time=138
cuda-9.2-Cuda_Serial-release build_time=853 run_time=205
gcc-7.3.0-OpenMP-release build_time=165 run_time=47
gcc-7.3.0-Pthread-release build_time=130 run_time=63
gcc-8.3.0-Serial-release build_time=171 run_time=50
gcc-9.1-OpenMP-release build_time=210 run_time=47
gcc-9.1-Serial-release build_time=192 run_time=48
intel-17.0.1-Serial-release build_time=394 run_time=51
intel-18.0.5-OpenMP-release build_time=697 run_time=50
intel-19.0.5-Pthread-release build_time=355 run_time=63

ndellingwood · 2020-10-29T01:49:43Z

@brian-kelley removal of the RCM struct and capabilities in this PR will break MueLu (noticed in some testing with VOTD kokkos-kernels) in the CuthillMcKee and ReverseCuthillMcKee routines in the MueLu utilities:
https://github.com/trilinos/Trilinos/blob/master/packages/muelu/src/Utils/MueLu_Utilities_kokkos_def.hpp

Is there WIP for corresponding updates in MueLu?

lucbv · 2020-10-29T14:35:34Z

@ndellingwood
@brian-kelley and I discussed the removal of the reordering algorithms and we have not yet propagated this change to MueLu (thanks for reminding us!) but that's the plan : )

ndellingwood · 2020-10-29T15:18:33Z

@lucbv thanks for the update, if you don't mind mention me in PRs with the changes to keep me in the loop, those changes will be a blocker on 3.3 integration testing

In PR kokkos#828, I changed the default D1 coloring algorithm for CUDA from EB to VBBIT. Although VBBIT is faster for fairly balanced/regular problems, it is causing an increase in MTGS + GMRES iterations in some Ifpack2 tests compared to EB. This causes random failures since the tests expect convergence within a certain number of iters.

brian-kelley added WIP Work In Progress enhancement and removed WIP Work In Progress labels Oct 13, 2020

brian-kelley self-assigned this Oct 14, 2020

brian-kelley requested review from lucbv and ndellingwood October 14, 2020 04:20

lucbv requested changes Oct 14, 2020

View reviewed changes

lucbv requested changes Oct 26, 2020

View reviewed changes

lucbv approved these changes Oct 26, 2020

View reviewed changes

brian-kelley added 14 commits October 26, 2020 21:26

WIP: adding HIP codepaths in preparation for tests/ETI

e8112f7

Fixed spmv for OpenMP

27e0a29

Removed #pragma unroll

f993534

Used to be a normal for loop, now it's a ThreadVectorRange

Update for deprecated removal

5e0b119

Fix SpMV transpose functors

5315d0f

Add back D1 default algorithm verbose output

da54288

Fix HIP device code macros

b5349f1

It's __HIP_DEVICE_COMPILE__, not __CUDA_ARCH__.

Restore d2 coloring verbose about default algo

9a9ec34

Fix indent

2c3e3a4

Factor out pool #chunks computation for SpGEMM

d2448f2

(same code used in 7 places)

Made compute_num_pool_chunks a member of SpGEMM

f4cacdc

Fix signed vs. unsigned

7aef9b1

WIP: improving performance of spmv for openmp

fd94bd4

Fixed typo in spmv

792e48f

brian-kelley force-pushed the HIP_Algorithms branch from 664e5ea to 792e48f Compare October 27, 2020 03:47

Use range policy for omp mode T spmv/spmv_mv

55933aa

lucbv approved these changes Oct 27, 2020

View reviewed changes

brian-kelley added 3 commits October 27, 2020 08:58

Remove duplicate local typedef

80fc49c

Remove unused var

1a35a8d

Fix execution_space typedef

d3909de

brian-kelley merged commit 513667d into kokkos:develop Oct 27, 2020

brian-kelley deleted the HIP_Algorithms branch October 27, 2020 18:55

lucbv mentioned this pull request Oct 28, 2020

HIP backend general issue #806

Closed

brian-kelley mentioned this pull request Oct 29, 2020

Add hip support to dist-1 coloring perftest #841

Merged

brian-kelley mentioned this pull request Dec 5, 2020

Restore distance-1 default algos from 3.2.0 #857

Merged

lucbv mentioned this pull request Feb 10, 2021

SPMV_MV: reimplementing SpMV for multivectors using hierarchical parallelism #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HIP support to src/ and perf_test/ #828

Add HIP support to src/ and perf_test/ #828

brian-kelley commented Oct 13, 2020 •

edited

Loading

lucbv left a comment

lucbv Oct 14, 2020

brian-kelley Oct 14, 2020

brian-kelley commented Oct 14, 2020 •

edited

Loading

ndellingwood commented Oct 21, 2020

lucbv commented Oct 21, 2020

brian-kelley commented Oct 22, 2020

lucbv left a comment

lucbv Oct 26, 2020

lucbv Oct 26, 2020

brian-kelley Oct 26, 2020

lucbv Oct 26, 2020

lucbv left a comment

lucbv left a comment

lucbv commented Oct 27, 2020

brian-kelley commented Oct 27, 2020

brian-kelley commented Oct 27, 2020

ndellingwood commented Oct 29, 2020

lucbv commented Oct 29, 2020

ndellingwood commented Oct 29, 2020

Add HIP support to src/ and perf_test/ #828

Add HIP support to src/ and perf_test/ #828

Conversation

brian-kelley commented Oct 13, 2020 • edited Loading

lucbv left a comment

Choose a reason for hiding this comment

lucbv Oct 14, 2020

Choose a reason for hiding this comment

brian-kelley Oct 14, 2020

Choose a reason for hiding this comment

brian-kelley commented Oct 14, 2020 • edited Loading

ndellingwood commented Oct 21, 2020

lucbv commented Oct 21, 2020

brian-kelley commented Oct 22, 2020

lucbv left a comment

Choose a reason for hiding this comment

lucbv Oct 26, 2020

Choose a reason for hiding this comment

lucbv Oct 26, 2020

Choose a reason for hiding this comment

brian-kelley Oct 26, 2020

Choose a reason for hiding this comment

lucbv Oct 26, 2020

Choose a reason for hiding this comment

lucbv left a comment

Choose a reason for hiding this comment

lucbv left a comment

Choose a reason for hiding this comment

lucbv commented Oct 27, 2020

brian-kelley commented Oct 27, 2020

brian-kelley commented Oct 27, 2020

ndellingwood commented Oct 29, 2020

lucbv commented Oct 29, 2020

ndellingwood commented Oct 29, 2020

brian-kelley commented Oct 13, 2020 •

edited

Loading

brian-kelley commented Oct 14, 2020 •

edited

Loading