Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HIP support to src/ and perf_test/ #828

Merged
merged 18 commits into from
Oct 27, 2020

Conversation

brian-kelley
Copy link
Contributor

@brian-kelley brian-kelley commented Oct 13, 2020

Where possible, make execution space handling generic across all GPU-like devices.

@lucbv Since these changes don't break existing supported devices, I say we just merge it now instead of waiting for ETI and caraway to update.

#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=781 run_time=181
clang-8.0-Pthread_Serial-release build_time=260 run_time=130
clang-9.0.0-Pthread-release build_time=161 run_time=61
clang-9.0.0-Serial-release build_time=156 run_time=49
cuda-10.1-Cuda_OpenMP-release build_time=932 run_time=155
cuda-11.0-Cuda_OpenMP-release build_time=1077 run_time=1083
cuda-9.2-Cuda_Serial-release build_time=864 run_time=197
gcc-7.3.0-OpenMP-release build_time=164 run_time=253
gcc-7.3.0-Pthread-release build_time=142 run_time=55
gcc-8.3.0-Serial-release build_time=161 run_time=48
gcc-9.1-OpenMP-release build_time=206 run_time=97
gcc-9.1-Serial-release build_time=181 run_time=46
intel-17.0.1-Serial-release build_time=359 run_time=55
intel-18.0.5-OpenMP-release build_time=652 run_time=48
intel-19.0.5-Pthread-release build_time=332 run_time=67

@brian-kelley brian-kelley added WIP Work In Progress enhancement and removed WIP Work In Progress labels Oct 13, 2020
@brian-kelley brian-kelley self-assigned this Oct 14, 2020
Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as a general comment, from the point of view of the person reviewing this PR, I would have liked it if it could have a been a few smaller PRs, this was real work reading through it x)

I noted at least on opportunity to factor out a little helper function for out which appeared three or four times in the code.

Overall a lot of changes revolve around the use of:

  1. kk_is_gpu_exec_space<exec_space>()
  2. kk_get_suggested_vector_size()
  3. kk_get_free_total_memory<mem_space>(free_mem, total_mem)

I think these functions should be advertised a bit, maybe during the next stand-up so they can be adopted more by others in the team

Finally the changes in SpMV could have been a PR on their own and need more investigation in terms of performance impact, although as the architectures we care about have changed this might be a good time to make these changes.

iEntry ++)
#endif
{
Kokkos::parallel_for(Kokkos::ThreadVectorRange(dev, row.length),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried to make these changes in the past but I think on some architecture, maybe intel KNL with intel compiler it resulted in a loss of performance and we canned it.
Let's see what @srajama1 says about it now, do we still care about KNL performance or can we make my dream come try and get rid of the ugly pragmas that Kokkos should help us avoid?

Same comment pretty much goes for the rest of the changes in the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv I'm happy to rerun some benchmarks before & after this change. I saw some comments like "// This should be a thread loop as soon as we can use C++11" so I figured this would be converted to 3-level parallelism at some point and I didn't know performance was a reason to keep it as normal for loops.

I'm pretty sure Kokkos's parallel_for over ThreadVectorRange looks like this for OpenMP:

template <typename iType, class Closure, class Member>
KOKKOS_INLINE_FUNCTION void parallel_for(
    Impl::ThreadVectorRangeBoundariesStruct<iType, Member> const&
        loop_boundaries,
    Closure const& closure,
    typename std::enable_if<Impl::is_host_thread_team_member<Member>::value>::
        type const** = nullptr) {
#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP
#pragma ivdep
#endif
  for (iType i = loop_boundaries.start; i < loop_boundaries.end;
       i += loop_boundaries.increment) {
    closure(i);
  }
}

so after inlining, I would hope that it basically turns into a regular for loop with ivdep. Then it's basically the original code but without unroll and loop count pragmas.

@brian-kelley
Copy link
Contributor Author

brian-kelley commented Oct 14, 2020

@lucbv Thanks a lot for the thorough review. I resolved the things here apart from performance changes in SpMV and I'm re running the tests.

Update:
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=657 run_time=124
clang-8.0-Pthread_Serial-release build_time=258 run_time=124
clang-9.0.0-Pthread-release build_time=145 run_time=54
clang-9.0.0-Serial-release build_time=152 run_time=45
cuda-10.1-Cuda_OpenMP-release build_time=880 run_time=125
cuda-11.0-Cuda_OpenMP-release build_time=846 run_time=123
cuda-9.2-Cuda_Serial-release build_time=864 run_time=173
gcc-7.3.0-OpenMP-release build_time=156 run_time=44
gcc-7.3.0-Pthread-release build_time=141 run_time=56
gcc-8.3.0-Serial-release build_time=153 run_time=47
gcc-9.1-OpenMP-release build_time=190 run_time=44
gcc-9.1-Serial-release build_time=176 run_time=46
intel-17.0.1-Serial-release build_time=341 run_time=50
intel-18.0.5-OpenMP-release build_time=597 run_time=47
intel-19.0.5-Pthread-release build_time=290 run_time=55

@ndellingwood
Copy link
Contributor

@lucbv checking in to see if this is ready for merge following review comments?

@lucbv
Copy link
Contributor

lucbv commented Oct 21, 2020

@ndellingwood sorry I have not had time to look at @brian-kelley responses, I will get it done tonight or tomorrow, thanks for reminding me!

@brian-kelley
Copy link
Contributor Author

@lucbv Sorry I haven't finished this yet, the only thing still to do is see if using TeamThread/ThreadVector loops in SpMV caused performance regressions. I made changes based on all your other comments, though.

Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-kelley I just hav one question remaining here.
Thanks for addressing my previous questions.

@@ -392,7 +399,7 @@ KOKKOS_INLINE_FUNCTION
void impl_team_gemm_block(const TeamHandle& team, const ViewTypeC& C, const ViewTypeA& A, const ViewTypeB& B) {
typedef typename ViewTypeC::non_const_value_type ScalarC;
// GNU COMPILER BUG WORKAROUND
#if defined(KOKKOS_COMPILER_GNU) || !defined(__CUDA_ARCH__)
#if defined(KOKKOS_COMPILER_GNU) && (!defined(__CUDA_ARCH__) || !defined(__HIP_DEVICE_COMPILE__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct, it seems that you might want to replace && should be replaced with ||

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am totally fine if the new behavior is what we want to implement, I just wanted to point out that the logic probably changed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv I noticed this was different than most of the other GCC bug workarounds involving const, which used defined(KOKKOS_COMPILER_GNU) && !defined(__CUDA_ARCH__). || would mean "anything that's not CUDA device code" which I don't think is what was intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense to me, I just wanted to confirm as this was clearly changing the logic.
I'm good with this PR.

Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the changes in this PR, thanks @brian-kelley for all this work!

Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine with me

@lucbv
Copy link
Contributor

lucbv commented Oct 27, 2020

@brian-kelley I see that you are adding more thing to this PR, let me know when you want a new round of review

@brian-kelley
Copy link
Contributor Author

@lucbv Right now I'm just re-running the spot checks for the SpMV changes (RangePolicy for non-GPU) and just fixing the minor errors that come up. Basically I think with your last review you've seen everything.

(For some more about that, I did actually see a ~10% slowdown for some combinations of mode, layout and rank on CTS-1, but switching to RangePolicy made all of them at least as fast as before).

@brian-kelley
Copy link
Contributor Author

Last round of testing:
#######################################################
PASSED TESTS
#######################################################
clang-8.0-Cuda_OpenMP-release build_time=714 run_time=135
clang-8.0-Pthread_Serial-release build_time=312 run_time=155
clang-9.0.0-Pthread-release build_time=160 run_time=62
clang-9.0.0-Serial-release build_time=162 run_time=47
cuda-10.1-Cuda_OpenMP-release build_time=884 run_time=135
cuda-11.0-Cuda_OpenMP-release build_time=902 run_time=138
cuda-9.2-Cuda_Serial-release build_time=853 run_time=205
gcc-7.3.0-OpenMP-release build_time=165 run_time=47
gcc-7.3.0-Pthread-release build_time=130 run_time=63
gcc-8.3.0-Serial-release build_time=171 run_time=50
gcc-9.1-OpenMP-release build_time=210 run_time=47
gcc-9.1-Serial-release build_time=192 run_time=48
intel-17.0.1-Serial-release build_time=394 run_time=51
intel-18.0.5-OpenMP-release build_time=697 run_time=50
intel-19.0.5-Pthread-release build_time=355 run_time=63

@brian-kelley brian-kelley merged commit 513667d into kokkos:develop Oct 27, 2020
@brian-kelley brian-kelley deleted the HIP_Algorithms branch October 27, 2020 18:55
@lucbv lucbv mentioned this pull request Oct 28, 2020
@ndellingwood
Copy link
Contributor

@brian-kelley removal of the RCM struct and capabilities in this PR will break MueLu (noticed in some testing with VOTD kokkos-kernels) in the CuthillMcKee and ReverseCuthillMcKee routines in the MueLu utilities:
https://github.com/trilinos/Trilinos/blob/master/packages/muelu/src/Utils/MueLu_Utilities_kokkos_def.hpp

Is there WIP for corresponding updates in MueLu?

@lucbv
Copy link
Contributor

lucbv commented Oct 29, 2020

@ndellingwood
@brian-kelley and I discussed the removal of the reordering algorithms and we have not yet propagated this change to MueLu (thanks for reminding us!) but that's the plan : )

@ndellingwood
Copy link
Contributor

@lucbv thanks for the update, if you don't mind mention me in PRs with the changes to keep me in the loop, those changes will be a blocker on 3.3 integration testing

brian-kelley added a commit to brian-kelley/kokkos-kernels that referenced this pull request Dec 5, 2020
In PR kokkos#828, I changed the default D1 coloring algorithm for CUDA
from EB to VBBIT. Although VBBIT is faster for fairly balanced/regular
problems, it is causing an increase in MTGS + GMRES iterations
in some Ifpack2 tests compared to EB. This causes random failures
since the tests expect convergence within a certain number of iters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants