Spmv cusparse tpl #618

lucbv · 2020-02-25T15:35:34Z

Adding cusparse backend for SpMV to allow users to benefit from latest improvement in nvidia code.
This will potentially provide better performance across different GPU architectures as it is complicated to tune the SpMV kernel for all graphic cards and compute versions.
The support will be provided for cuda 9.2 and 10 to begin with.

ndellingwood · 2020-02-25T16:42:26Z

The support will be provided for cuda 9.2 and 10 to begin with

@lucbv Kokkos support is for cuda >= ~~9.2~~ 9.0 now, so you shouldn't need to worry about older versions of cuda (in case that was a concern); does this PR support 10.0 and 10.1?

Edit: Mistake in Cuda version support

crtrott · 2020-02-25T16:43:12Z

Nathan: I think we are at >= 9.0 not 9.2

ndellingwood · 2020-02-25T16:44:14Z

Edited my comment, thanks for clarification.

lucbv · 2020-02-25T16:58:24Z

@ndellingwood
currently this PR supports building without failing x)
It's really an early effort and I am currently trying to write a unit-test that exercises the tpl code path when enabled.
Once I have that I should be able to actually put in the cusparse interface, at that point I will be able to look at how the implementation details in the cusparse API guide the support possible from within kokkos-kernels.

mhoemmen · 2020-02-26T16:27:39Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

+  /* initialize cusparse library */
+  status = cusparseCreate(&handle);
+  if (status != CUSPARSE_STATUS_SUCCESS) {
+    throw("KokkosSparse::spmv[TPL_CUSPARSE,double]: cusparse was not initialized correctly");


Please don't throw const char[]. Use a subclass of std::exception.

Kokkos should have a function for reporting errors; please use that instead of throwing.

Once things are actually working (right now the spmv does not give the right answer), I'll look into the proper way of throwing exception or reporting errors. We must have some mechanism for that in other tpl interfaces, maybe in cublas_gemm interface for instance?

@mhoemmen
I looked into Kokkos_Cuda_Error.hpp and the main blocker to use CUDA_SAFE_CALL is that it uses the type cudaError when the cusparse calls are retunring cusparseStatus_t instead. At this point I added better throws that use std::runtime_exception.
I will add to the todo list the creation of a CUSPARSE_SAFE_CALL macro that replicates mechanism demonstrated in Kokkos but for cusparse return type.

mhoemmen · 2020-02-26T16:27:58Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

+
+  /* create matrix */
+  cusparseSpMatDescr_t A_cusparse;
+  status = cusparseCreateCsr(&A_cusparse, A.numRows(), A.numCols(), A.nnz(),


You're exposing cuSPARSE functions in the header file :(

I am actually thinking of creating a small utility function that takes in a Kokkos::CrsMatrix and returns a cusparseSpMatDescr_t so part of this code will disappear down the road...
I will also look into other TPL specialization we have in kokkos-kernels and see how things are implemented. Depending on that I might move things around accordingly.

@lucbv Check out how I hid the TPL header file includes for cuSOLVER in tpetra/tsqr/src.

@mhoemmen as discussed I will outline a strategy to reduce the size of the macros in the KokkosSparse_spmv_tpl_spec_decl.hpp and hide the header in an implementation file.

lucbv · 2020-03-03T14:48:50Z

Alright, this is now passing cm_test_all_sandia --spot-check-tpls, I do not think that the build on bowman was useful to check that the tpl work correctly but at least we know it does not break the regular code path.

White

cuda-10.1.105-Cuda_Serial-release build_time=288 run_time=270
cuda-9.2.88-Cuda_OpenMP-release build_time=325 run_time=233
gcc-7.2.0-OpenMP-release build_time=87 run_time=52
gcc-7.2.0-Serial-release build_time=76 run_time=98
gcc-7.4.0-OpenMP-release build_time=88 run_time=40

Bowman

intel-16.4.258-Pthread-release build_time=523 run_time=296
intel-16.4.258-Pthread_Serial-release build_time=682 run_time=530
intel-16.4.258-Serial-release build_time=520 run_time=250
intel-17.2.174-OpenMP-release build_time=683 run_time=91
intel-17.2.174-OpenMP_Serial-release build_time=848 run_time=296
intel-17.2.174-Pthread-release build_time=584 run_time=207
intel-17.2.174-Pthread_Serial-release build_time=775 run_time=416
intel-17.2.174-Serial-release build_time=550 run_time=197

Kokkos-dev

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=129 run_time=147
clang-4.0.1-Pthread_Serial-release build_time=123 run_time=181
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=349 run_time=281
clang-7.0.1-Cuda_OpenMP-release build_time=339 run_time=278
cuda-9.2-Cuda_OpenMP-release build_time=496 run_time=391
gcc-5.3.0-OpenMP-hwloc-release build_time=95 run_time=48
gcc-5.3.0-OpenMP-release build_time=96 run_time=47
gcc-7.3.0-Serial-hwloc-release build_time=80 run_time=84
gcc-7.3.0-Serial-release build_time=89 run_time=83
intel-17.0.1-OpenMP-hwloc-release build_time=158 run_time=33
intel-17.0.1-OpenMP-release build_time=158 run_time=33

lucbv · 2020-03-03T14:51:10Z

@srajama1 @ndellingwood any thoughts on this PR?
I am keen on keeping it short and merging it since it actually works as intended now.
However I would like to work in a second PR on creating some utilities, such as a Kokkos to Cusparse matrix converter and also discuss with @mhoemmen some more on hiding the headers.

ndellingwood · 2020-03-03T18:37:09Z

@lucbv I haven't looked carefully, but wanted to check that using the deprecated cusparse matrix-vector interface (*csrmv) is supported by all the versions of Cuda we need to support?

lucbv · 2020-03-03T21:18:51Z

@ndellingwood the tpl is guarded to make sure it only gets enabled for cuda 9 and up.
The *csrmv routines are defined correctly in cuda 9 so there is no issue with it.

lucbv · 2020-03-04T00:20:59Z

@mhoemmen if you don't see any inconvenience with this I will update issue #618 to explain how I plan to address further your comments in a subsequent PR that will refactor this code. But I would like for the current version to be merged.

I will re-run the spot-check since I did one more push for the exception handling.

mhoemmen · 2020-03-05T16:35:42Z

@lucbv wrote:

if you don't see any inconvenience with this I will update issue #618 to explain how I plan to address further your comments in a subsequent PR that will refactor this code. But I would like for the current version to be merged.

I'm cool with that -- thanks! :-D

Supporting both cuda 9 interface and cuda 10.2 interface Support for float_int_int and double_int_int Could potentially support int64_t with cuda 10.2 interface. Modifying the spmv_struct_tunning test to make it compile appropriately.

lucbv · 2020-03-05T18:21:27Z

@mhoemmen @ndellingwood @srajama1
code is ready and passes spot-check, I have documented some next steps in #618 to clean things up a bit but it does not need to happen now for this to be merged. Let me know if you have strong objection to something in the PR or if we can merge it.

White --spot-check

#######################################################
PASSED TESTS
#######################################################
cuda-10.1.105-Cuda_OpenMP-release build_time=329 run_time=250
cuda-10.1.105-Cuda_Serial-release build_time=297 run_time=298
cuda-9.2.88-Cuda_OpenMP-release build_time=293 run_time=328
cuda-9.2.88-Cuda_Serial-release build_time=297 run_time=376
gcc-6.4.0-OpenMP_Serial-release build_time=126 run_time=165
gcc-7.2.0-OpenMP-release build_time=84 run_time=57
gcc-7.2.0-OpenMP_Serial-release build_time=117 run_time=163
gcc-7.2.0-Serial-release build_time=77 run_time=102
ibm-16.1.0-Serial-release build_time=403 run_time=114

White --spot-check-tpls

#######################################################
PASSED TESTS
#######################################################
cuda-10.1.105-Cuda_Serial-release build_time=302 run_time=269
cuda-9.2.88-Cuda_OpenMP-release build_time=313 run_time=305
gcc-7.2.0-OpenMP-release build_time=89 run_time=52
gcc-7.2.0-Serial-release build_time=74 run_time=97
gcc-7.4.0-OpenMP-release build_time=90 run_time=40

Kokkos-dev --spot-check

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=126 run_time=159
clang-4.0.1-Pthread_Serial-release build_time=129 run_time=217
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=345 run_time=288
clang-7.0.1-Cuda_OpenMP-release build_time=348 run_time=321
cuda-9.2-Cuda_OpenMP-release build_time=401 run_time=414
gcc-5.3.0-OpenMP-hwloc-release build_time=98 run_time=57
gcc-5.3.0-OpenMP-release build_time=95 run_time=55
gcc-7.3.0-Serial-hwloc-release build_time=81 run_time=92
gcc-7.3.0-Serial-release build_time=81 run_time=91
intel-17.0.1-OpenMP-hwloc-release build_time=168 run_time=56
intel-17.0.1-OpenMP-release build_time=167 run_time=54

ndellingwood

@lucbv I added a couple comments, thanks for working on this!

ndellingwood · 2020-03-05T23:20:40Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

@@ -30,7 +30,7 @@
 // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 // EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// PROFIS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF


typo here, maybe accidental bump on the keyboard?

Yeah, likely an inadvertent keystroke... easy to fix though

ndellingwood · 2020-03-05T23:21:46Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

+    }									\
+  };
+
+  KOKKOSSPARSE_SPMV_CUSPARSE(double, int, Kokkos::LayoutLeft,  true)


Is complex support to be added later?

Potentially but I am a bit worried about the alignment of cuComplex and that of Kokkos::complex for that reason I would like to defer this to allow careful testing when that feature goes in.

I'll need to look a bit more carefully at the ETI TPL layer you added, but if complex is not enabled then does this mean kokkos-kernels spmv will use the existing fallback implementation? If not, that'll need to be addressed or else this will cause issues in Trilinos if anyone enables cusparse.

If you look at KokkosSparse_spmv_tpl_spec_avail.hpp you will see that the tpl is not made available for complex type so it will fall back to kokkos-kernel implementation of spmv in that case.

@ndellingwood Let me second Luc's concern about alignment of complex. I remember y'all added an option to control alignment of Kokkos::complex; we could use that to decide whether to enable complex support.

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

ndellingwood · 2020-03-06T00:46:43Z

@lucbv approved, shall I go ahead and merge?

lucbv · 2020-03-06T00:56:14Z

@ndellingwood I just pushed minor changes to address your comments, let me know what you think of them

mhoemmen

Thanks Luc!

mhoemmen · 2020-03-06T01:05:00Z

perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp

+    cusparse_Acolind(cusparse_Acolind_) {};
+
+  void doCopy() {
+    Kokkos::RangePolicy<execution_space, rowPtrTag> rowPtrPolicy(0, Arowptr.extent(0));


See Tpetra::Details::copyOffsets. That would be a useful kokkos-kernels utility.

mhoemmen · 2020-03-06T01:05:26Z

perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp

@@ -101,7 +143,7 @@ void struct_matvec(const int stencil_type,
                   const AMatrix& A,
                   const XVector& x,
                   typename YVector::const_value_type& beta,
-                   const YVector& y,
+                   YVector& y,


Why not const? It's a Kokkos::View.

I could bring back the const, the intent is to show that this view could be modified, although its data is really what is being modified.

mhoemmen · 2020-03-06T01:05:50Z

perf_test/sparse/KokkosSparse_spmv_struct_tuning.cpp

-      	  Acolind_cusparse[idx] = Acolind[idx];
-      	});
+      cusparse_int_type Arowptr_cusparse("Arowptr", Arowptr.extent(0));
+      cusparse_int_type Acolind_cusparse("Acolind", Acolind.extent(0));


You shouldn't need to allocate column indices unless they don't have type int.

That's correct but since this is a test I did not want to write too much logic around the type of the ordinal.
It would make the test a bit faster but this part is not timed and we already know that the cusparse interface is changing in future versions. Ideally this test should be rewritten down the road for the new interface that will make this a bit cleaner...

mhoemmen · 2020-03-06T01:06:13Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

+    // using ordinal_type = typename AMatrix::non_const_ordinal_type;
+    using value_type   = typename AMatrix::non_const_value_type;
+
+#if defined(CUSPARSE_VERSION) && (10300 <= CUSPARSE_VERSION)


I thought it was CUDA 10.1 that added the new interface.

I think it's a newer "patch" update to 10.1 that added the new interface, 10.1.patch, but is not included in all versions of 10.1 (in particular not the version we have a sems module for nightly testing)

It's a bit complicated, but the bottom line is that cuda and cusparse do not necessarily have the same version numbering.

Release number CUDA_VERSION CUSPARSE_VERSION

10.1.168 10100 undef

10.1.243 10100 10300

Note that in release 10.1.168, cusparse does define the following macros:

#define CUSPARSE_VER_MAJOR 10 #define CUSPARSE_VER_MINOR 2 #define CUSPARSE_VER_PATCH 0 #define CUSPARSE_VER_BUILD 0

where as in 10.1.243 it defines

#define CUSPARSE_VER_MAJOR 10 #define CUSPARSE_VER_MINOR 3 #define CUSPARSE_VER_PATCH 0 #define CUSPARSE_VER_BUILD 243 #define CUSPARSE_VERSION (CUSPARSE_VER_MAJOR * 1000 + \ CUSPARSE_VER_MINOR * 100 + \ CUSPARSE_VER_PATCH)

mhoemmen · 2020-03-06T01:06:54Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

+    /* Initialize cusparse */
+    cusparseStatus_t cusparseStatus;
+    cusparseHandle_t cusparseHandle=0;
+    cusparseStatus = cusparseCreate(&cusparseHandle);


We should time this -- it could be expensive to create this handle.

mhoemmen · 2020-03-06T01:07:21Z

src/impl/tpls/KokkosSparse_spmv_tpl_spec_decl.hpp

+      if (std::is_same<value_type,float>::value) {
+	cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
+					A.numRows(), A.numCols(), A.nnz(),
+					(const float *) &alpha, descrA,


Prefer reinterpret_cast to C-style casts.

ndellingwood · 2020-03-06T01:13:39Z

@mhoemmen do you request your comments be addressed prior to merge or as a follow up PR?

mhoemmen · 2020-03-06T01:23:45Z

@ndellingwood I'm good with this -- Luc and I talked earlier this week. Sorry to delay you!

srajama1

@lucbv : One comment. It would be nice to see how expensive handle creation is. As you pointed out it is good to make small PRs and keep incremental improvement. Can we also update the documentation ? I assume we have an issue that tracks this so @ndellingwood can document this in 3.1.

ndellingwood · 2020-03-06T04:06:35Z

I assume we have an issue that tracks

@srajama1 yes, we missed adding this earlier, cross-reference issue #614

ndellingwood · 2020-03-06T04:12:23Z

I'm good with this

@mhoemmen just wanted to check, always valuable feedback and didn't want to bypass anything requiring immediate address that was overlooked

It would be nice to see how expensive handle creation is

@srajama1 good call this would be useful, I'm going to go ahead with the merge (I think this is safe based on your follow up comment), this will get it through a round of nightlies (we have tpls in the nightly tests now :)

lucbv added the enhancement label Feb 25, 2020

lucbv self-assigned this Feb 25, 2020

mhoemmen reviewed Feb 26, 2020

View reviewed changes

lucbv force-pushed the Spmv_cusparse_tpl branch 2 times, most recently from fcac9ff to 0160cb5 Compare February 29, 2020 00:18

lucbv changed the title ~~[WIP]: Spmv cusparse tpl~~ Spmv cusparse tpl Feb 29, 2020

lucbv force-pushed the Spmv_cusparse_tpl branch from c4aab15 to f5d6a13 Compare March 2, 2020 23:55

lucbv force-pushed the Spmv_cusparse_tpl branch from f5d6a13 to 3185175 Compare March 4, 2020 00:13

lucbv force-pushed the Spmv_cusparse_tpl branch from 72d8f4f to 1cd23df Compare March 5, 2020 14:22

ndellingwood mentioned this pull request Mar 5, 2020

KokkosSparse_spmv_struct_tuning: add lambda guard #641

Merged

lucbv force-pushed the Spmv_cusparse_tpl branch from 1cd23df to 967b73b Compare March 5, 2020 16:31

lucbv added 2 commits March 5, 2020 09:39

Adding cusparse support in SpMV

87014fc

Supporting both cuda 9 interface and cuda 10.2 interface Support for float_int_int and double_int_int Could potentially support int64_t with cuda 10.2 interface. Modifying the spmv_struct_tunning test to make it compile appropriately.

Removing cuda lambda guard from spmv_struct_tuning

967b73b

lucbv mentioned this pull request Mar 5, 2020

Make sure cuda lambda support is enabled for cusparse tests #640

Merged

lucbv requested review from ndellingwood and srajama1 March 5, 2020 22:53

ndellingwood reviewed Mar 5, 2020

View reviewed changes

ndellingwood approved these changes Mar 6, 2020

View reviewed changes

Adding smarter label in spmv cusparse

e328907

mhoemmen reviewed Mar 6, 2020

View reviewed changes

srajama1 approved these changes Mar 6, 2020

View reviewed changes

ndellingwood merged commit 73aa853 into kokkos:develop Mar 6, 2020

lucbv deleted the Spmv_cusparse_tpl branch March 6, 2020 17:32

kokkos-devops-admin mentioned this pull request Nov 24, 2021

Add rocBLAS GEMV wrapper #1201

Merged

kokkos-devops-admin mentioned this pull request Jun 17, 2024

Bump actions/checkout from 4.1.6 to 4.1.7 #2248

Merged

Spmv cusparse tpl #618

Spmv cusparse tpl #618

Conversation

lucbv commented Feb 25, 2020

ndellingwood commented Feb 25, 2020 • edited Loading

crtrott commented Feb 25, 2020

ndellingwood commented Feb 25, 2020

lucbv commented Feb 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucbv commented Mar 3, 2020

White

Bowman

Kokkos-dev

lucbv commented Mar 3, 2020

ndellingwood commented Mar 3, 2020

lucbv commented Mar 3, 2020

lucbv commented Mar 4, 2020

mhoemmen commented Mar 5, 2020

lucbv commented Mar 5, 2020

White --spot-check

White --spot-check-tpls

Kokkos-dev --spot-check

ndellingwood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndellingwood commented Mar 6, 2020

lucbv commented Mar 6, 2020

mhoemmen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndellingwood commented Mar 6, 2020

mhoemmen commented Mar 6, 2020

srajama1 left a comment

Choose a reason for hiding this comment

ndellingwood commented Mar 6, 2020

ndellingwood commented Mar 6, 2020

ndellingwood commented Feb 25, 2020 •

edited

Loading