-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HIP support to src/ and perf_test/ #828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as a general comment, from the point of view of the person reviewing this PR, I would have liked it if it could have a been a few smaller PRs, this was real work reading through it x)
I noted at least on opportunity to factor out a little helper function for out which appeared three or four times in the code.
Overall a lot of changes revolve around the use of:
kk_is_gpu_exec_space<exec_space>()
kk_get_suggested_vector_size()
kk_get_free_total_memory<mem_space>(free_mem, total_mem)
I think these functions should be advertised a bit, maybe during the next stand-up so they can be adopted more by others in the team
Finally the changes in SpMV could have been a PR on their own and need more investigation in terms of performance impact, although as the architectures we care about have changed this might be a good time to make these changes.
src/sparse/impl/KokkosSparse_spgemm_impl_triangle_no_compression.hpp
Outdated
Show resolved
Hide resolved
iEntry ++) | ||
#endif | ||
{ | ||
Kokkos::parallel_for(Kokkos::ThreadVectorRange(dev, row.length), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried to make these changes in the past but I think on some architecture, maybe intel KNL with intel compiler it resulted in a loss of performance and we canned it.
Let's see what @srajama1 says about it now, do we still care about KNL performance or can we make my dream come try and get rid of the ugly pragmas that Kokkos should help us avoid?
Same comment pretty much goes for the rest of the changes in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucbv I'm happy to rerun some benchmarks before & after this change. I saw some comments like "// This should be a thread loop as soon as we can use C++11" so I figured this would be converted to 3-level parallelism at some point and I didn't know performance was a reason to keep it as normal for loops.
I'm pretty sure Kokkos's parallel_for over ThreadVectorRange looks like this for OpenMP:
template <typename iType, class Closure, class Member>
KOKKOS_INLINE_FUNCTION void parallel_for(
Impl::ThreadVectorRangeBoundariesStruct<iType, Member> const&
loop_boundaries,
Closure const& closure,
typename std::enable_if<Impl::is_host_thread_team_member<Member>::value>::
type const** = nullptr) {
#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP
#pragma ivdep
#endif
for (iType i = loop_boundaries.start; i < loop_boundaries.end;
i += loop_boundaries.increment) {
closure(i);
}
}
so after inlining, I would hope that it basically turns into a regular for loop with ivdep. Then it's basically the original code but without unroll and loop count pragmas.
@lucbv Thanks a lot for the thorough review. I resolved the things here apart from performance changes in SpMV and I'm re running the tests. Update: |
@lucbv checking in to see if this is ready for merge following review comments? |
@ndellingwood sorry I have not had time to look at @brian-kelley responses, I will get it done tonight or tomorrow, thanks for reminding me! |
@lucbv Sorry I haven't finished this yet, the only thing still to do is see if using TeamThread/ThreadVector loops in SpMV caused performance regressions. I made changes based on all your other comments, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brian-kelley I just hav one question remaining here.
Thanks for addressing my previous questions.
@@ -392,7 +399,7 @@ KOKKOS_INLINE_FUNCTION | |||
void impl_team_gemm_block(const TeamHandle& team, const ViewTypeC& C, const ViewTypeA& A, const ViewTypeB& B) { | |||
typedef typename ViewTypeC::non_const_value_type ScalarC; | |||
// GNU COMPILER BUG WORKAROUND | |||
#if defined(KOKKOS_COMPILER_GNU) || !defined(__CUDA_ARCH__) | |||
#if defined(KOKKOS_COMPILER_GNU) && (!defined(__CUDA_ARCH__) || !defined(__HIP_DEVICE_COMPILE__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct, it seems that you might want to replace &&
should be replaced with ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am totally fine if the new behavior is what we want to implement, I just wanted to point out that the logic probably changed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucbv I noticed this was different than most of the other GCC bug workarounds involving const, which used defined(KOKKOS_COMPILER_GNU) && !defined(__CUDA_ARCH__)
. ||
would mean "anything that's not CUDA device code" which I don't think is what was intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense to me, I just wanted to confirm as this was clearly changing the logic.
I'm good with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the changes in this PR, thanks @brian-kelley for all this work!
Used to be a normal for loop, now it's a ThreadVectorRange
It's __HIP_DEVICE_COMPILE__, not __CUDA_ARCH__.
(same code used in 7 places)
664e5ea
to
792e48f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine with me
@brian-kelley I see that you are adding more thing to this PR, let me know when you want a new round of review |
@lucbv Right now I'm just re-running the spot checks for the SpMV changes (RangePolicy for non-GPU) and just fixing the minor errors that come up. Basically I think with your last review you've seen everything. (For some more about that, I did actually see a ~10% slowdown for some combinations of mode, layout and rank on CTS-1, but switching to RangePolicy made all of them at least as fast as before). |
Last round of testing: |
@brian-kelley removal of the Is there WIP for corresponding updates in MueLu? |
@ndellingwood |
@lucbv thanks for the update, if you don't mind mention me in PRs with the changes to keep me in the loop, those changes will be a blocker on 3.3 integration testing |
In PR kokkos#828, I changed the default D1 coloring algorithm for CUDA from EB to VBBIT. Although VBBIT is faster for fairly balanced/regular problems, it is causing an increase in MTGS + GMRES iterations in some Ifpack2 tests compared to EB. This causes random failures since the tests expect convergence within a certain number of iters.
Where possible, make execution space handling generic across all GPU-like devices.
@lucbv Since these changes don't break existing supported devices, I say we just merge it now instead of waiting for ETI and caraway to update.
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=781 run_time=181
clang-8.0-Pthread_Serial-release build_time=260 run_time=130
clang-9.0.0-Pthread-release build_time=161 run_time=61
clang-9.0.0-Serial-release build_time=156 run_time=49
cuda-10.1-Cuda_OpenMP-release build_time=932 run_time=155
cuda-11.0-Cuda_OpenMP-release build_time=1077 run_time=1083
cuda-9.2-Cuda_Serial-release build_time=864 run_time=197
gcc-7.3.0-OpenMP-release build_time=164 run_time=253
gcc-7.3.0-Pthread-release build_time=142 run_time=55
gcc-8.3.0-Serial-release build_time=161 run_time=48
gcc-9.1-OpenMP-release build_time=206 run_time=97
gcc-9.1-Serial-release build_time=181 run_time=46
intel-17.0.1-Serial-release build_time=359 run_time=55
intel-18.0.5-OpenMP-release build_time=652 run_time=48
intel-19.0.5-Pthread-release build_time=332 run_time=67