Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spmv cusparse tpl #618

Merged
merged 3 commits into from
Mar 6, 2020
Merged

Spmv cusparse tpl #618

merged 3 commits into from
Mar 6, 2020

Conversation

lucbv
Copy link
Contributor

@lucbv lucbv commented Feb 25, 2020

Adding cusparse backend for SpMV to allow users to benefit from latest improvement in nvidia code.
This will potentially provide better performance across different GPU architectures as it is complicated to tune the SpMV kernel for all graphic cards and compute versions.
The support will be provided for cuda 9.2 and 10 to begin with.

@lucbv lucbv self-assigned this Feb 25, 2020
@ndellingwood
Copy link
Contributor

ndellingwood commented Feb 25, 2020

The support will be provided for cuda 9.2 and 10 to begin with

@lucbv Kokkos support is for cuda >= 9.2 9.0 now, so you shouldn't need to worry about older versions of cuda (in case that was a concern); does this PR support 10.0 and 10.1?

Edit: Mistake in Cuda version support

@crtrott
Copy link
Member

crtrott commented Feb 25, 2020

Nathan: I think we are at >= 9.0 not 9.2

@ndellingwood
Copy link
Contributor

Edited my comment, thanks for clarification.

@lucbv
Copy link
Contributor Author

lucbv commented Feb 25, 2020

@ndellingwood
currently this PR supports building without failing x)
It's really an early effort and I am currently trying to write a unit-test that exercises the tpl code path when enabled.
Once I have that I should be able to actually put in the cusparse interface, at that point I will be able to look at how the implementation details in the cusparse API guide the support possible from within kokkos-kernels.

/* initialize cusparse library */
status = cusparseCreate(&handle);
if (status != CUSPARSE_STATUS_SUCCESS) {
throw("KokkosSparse::spmv[TPL_CUSPARSE,double]: cusparse was not initialized correctly");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please don't throw const char[]. Use a subclass of std::exception.
  2. Kokkos should have a function for reporting errors; please use that instead of throwing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once things are actually working (right now the spmv does not give the right answer), I'll look into the proper way of throwing exception or reporting errors. We must have some mechanism for that in other tpl interfaces, maybe in cublas_gemm interface for instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen
I looked into Kokkos_Cuda_Error.hpp and the main blocker to use CUDA_SAFE_CALL is that it uses the type cudaError when the cusparse calls are retunring cusparseStatus_t instead. At this point I added better throws that use std::runtime_exception.
I will add to the todo list the creation of a CUSPARSE_SAFE_CALL macro that replicates mechanism demonstrated in Kokkos but for cusparse return type.


/* create matrix */
cusparseSpMatDescr_t A_cusparse;
status = cusparseCreateCsr(&A_cusparse, A.numRows(), A.numCols(), A.nnz(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're exposing cuSPARSE functions in the header file :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually thinking of creating a small utility function that takes in a Kokkos::CrsMatrix and returns a cusparseSpMatDescr_t so part of this code will disappear down the road...
I will also look into other TPL specialization we have in kokkos-kernels and see how things are implemented. Depending on that I might move things around accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv Check out how I hid the TPL header file includes for cuSOLVER in tpetra/tsqr/src.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhoemmen as discussed I will outline a strategy to reduce the size of the macros in the KokkosSparse_spmv_tpl_spec_decl.hpp and hide the header in an implementation file.

@lucbv lucbv force-pushed the Spmv_cusparse_tpl branch 2 times, most recently from fcac9ff to 0160cb5 Compare February 29, 2020 00:18
@lucbv lucbv changed the title [WIP]: Spmv cusparse tpl Spmv cusparse tpl Feb 29, 2020
@lucbv lucbv force-pushed the Spmv_cusparse_tpl branch from c4aab15 to f5d6a13 Compare March 2, 2020 23:55
@lucbv
Copy link
Contributor Author

lucbv commented Mar 3, 2020

Alright, this is now passing cm_test_all_sandia --spot-check-tpls, I do not think that the build on bowman was useful to check that the tpl work correctly but at least we know it does not break the regular code path.

White

cuda-10.1.105-Cuda_Serial-release build_time=288 run_time=270
cuda-9.2.88-Cuda_OpenMP-release build_time=325 run_time=233
gcc-7.2.0-OpenMP-release build_time=87 run_time=52
gcc-7.2.0-Serial-release build_time=76 run_time=98
gcc-7.4.0-OpenMP-release build_time=88 run_time=40

Bowman

intel-16.4.258-Pthread-release build_time=523 run_time=296
intel-16.4.258-Pthread_Serial-release build_time=682 run_time=530
intel-16.4.258-Serial-release build_time=520 run_time=250
intel-17.2.174-OpenMP-release build_time=683 run_time=91
intel-17.2.174-OpenMP_Serial-release build_time=848 run_time=296
intel-17.2.174-Pthread-release build_time=584 run_time=207
intel-17.2.174-Pthread_Serial-release build_time=775 run_time=416
intel-17.2.174-Serial-release build_time=550 run_time=197

Kokkos-dev

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=129 run_time=147
clang-4.0.1-Pthread_Serial-release build_time=123 run_time=181
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=349 run_time=281
clang-7.0.1-Cuda_OpenMP-release build_time=339 run_time=278
cuda-9.2-Cuda_OpenMP-release build_time=496 run_time=391
gcc-5.3.0-OpenMP-hwloc-release build_time=95 run_time=48
gcc-5.3.0-OpenMP-release build_time=96 run_time=47
gcc-7.3.0-Serial-hwloc-release build_time=80 run_time=84
gcc-7.3.0-Serial-release build_time=89 run_time=83
intel-17.0.1-OpenMP-hwloc-release build_time=158 run_time=33
intel-17.0.1-OpenMP-release build_time=158 run_time=33

@lucbv
Copy link
Contributor Author

lucbv commented Mar 3, 2020

@srajama1 @ndellingwood any thoughts on this PR?
I am keen on keeping it short and merging it since it actually works as intended now.
However I would like to work in a second PR on creating some utilities, such as a Kokkos to Cusparse matrix converter and also discuss with @mhoemmen some more on hiding the headers.

@ndellingwood
Copy link
Contributor

@lucbv I haven't looked carefully, but wanted to check that using the deprecated cusparse matrix-vector interface (*csrmv) is supported by all the versions of Cuda we need to support?

@lucbv
Copy link
Contributor Author

lucbv commented Mar 3, 2020

@ndellingwood the tpl is guarded to make sure it only gets enabled for cuda 9 and up.
The *csrmv routines are defined correctly in cuda 9 so there is no issue with it.

@lucbv lucbv force-pushed the Spmv_cusparse_tpl branch from f5d6a13 to 3185175 Compare March 4, 2020 00:13
@lucbv
Copy link
Contributor Author

lucbv commented Mar 4, 2020

@mhoemmen if you don't see any inconvenience with this I will update issue #618 to explain how I plan to address further your comments in a subsequent PR that will refactor this code. But I would like for the current version to be merged.

I will re-run the spot-check since I did one more push for the exception handling.

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 5, 2020

@lucbv wrote:

if you don't see any inconvenience with this I will update issue #618 to explain how I plan to address further your comments in a subsequent PR that will refactor this code. But I would like for the current version to be merged.

I'm cool with that -- thanks! :-D

lucbv added 2 commits March 5, 2020 09:39
Supporting both cuda 9 interface and cuda 10.2 interface
Support for float_int_int and double_int_int
Could potentially support int64_t with cuda 10.2 interface.
Modifying the spmv_struct_tunning test to make it compile appropriately.
@lucbv
Copy link
Contributor Author

lucbv commented Mar 5, 2020

@mhoemmen @ndellingwood @srajama1
code is ready and passes spot-check, I have documented some next steps in #618 to clean things up a bit but it does not need to happen now for this to be merged. Let me know if you have strong objection to something in the PR or if we can merge it.

White --spot-check

#######################################################
PASSED TESTS
#######################################################
cuda-10.1.105-Cuda_OpenMP-release build_time=329 run_time=250
cuda-10.1.105-Cuda_Serial-release build_time=297 run_time=298
cuda-9.2.88-Cuda_OpenMP-release build_time=293 run_time=328
cuda-9.2.88-Cuda_Serial-release build_time=297 run_time=376
gcc-6.4.0-OpenMP_Serial-release build_time=126 run_time=165
gcc-7.2.0-OpenMP-release build_time=84 run_time=57
gcc-7.2.0-OpenMP_Serial-release build_time=117 run_time=163
gcc-7.2.0-Serial-release build_time=77 run_time=102
ibm-16.1.0-Serial-release build_time=403 run_time=114

White --spot-check-tpls

#######################################################
PASSED TESTS
#######################################################
cuda-10.1.105-Cuda_Serial-release build_time=302 run_time=269
cuda-9.2.88-Cuda_OpenMP-release build_time=313 run_time=305
gcc-7.2.0-OpenMP-release build_time=89 run_time=52
gcc-7.2.0-Serial-release build_time=74 run_time=97
gcc-7.4.0-OpenMP-release build_time=90 run_time=40

Kokkos-dev --spot-check

#######################################################
PASSED TESTS
#######################################################
clang-4.0.1-Pthread_Serial-hwloc-release build_time=126 run_time=159
clang-4.0.1-Pthread_Serial-release build_time=129 run_time=217
clang-7.0.1-Cuda_OpenMP-hwloc-release build_time=345 run_time=288
clang-7.0.1-Cuda_OpenMP-release build_time=348 run_time=321
cuda-9.2-Cuda_OpenMP-release build_time=401 run_time=414
gcc-5.3.0-OpenMP-hwloc-release build_time=98 run_time=57
gcc-5.3.0-OpenMP-release build_time=95 run_time=55
gcc-7.3.0-Serial-hwloc-release build_time=81 run_time=92
gcc-7.3.0-Serial-release build_time=81 run_time=91
intel-17.0.1-OpenMP-hwloc-release build_time=168 run_time=56
intel-17.0.1-OpenMP-release build_time=167 run_time=54

Copy link
Contributor

@ndellingwood ndellingwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv I added a couple comments, thanks for working on this!

@@ -30,7 +30,7 @@
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
// PROFIS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here, maybe accidental bump on the keyboard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, likely an inadvertent keystroke... easy to fix though

} \
};

KOKKOSSPARSE_SPMV_CUSPARSE(double, int, Kokkos::LayoutLeft, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is complex support to be added later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially but I am a bit worried about the alignment of cuComplex and that of Kokkos::complex for that reason I would like to defer this to allow careful testing when that feature goes in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to look a bit more carefully at the ETI TPL layer you added, but if complex is not enabled then does this mean kokkos-kernels spmv will use the existing fallback implementation? If not, that'll need to be addressed or else this will cause issues in Trilinos if anyone enables cusparse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at KokkosSparse_spmv_tpl_spec_avail.hpp you will see that the tpl is not made available for complex type so it will fall back to kokkos-kernel implementation of spmv in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndellingwood Let me second Luc's concern about alignment of complex. I remember y'all added an option to control alignment of Kokkos::complex; we could use that to decide whether to enable complex support.

@ndellingwood
Copy link
Contributor

@lucbv approved, shall I go ahead and merge?

@lucbv
Copy link
Contributor Author

lucbv commented Mar 6, 2020

@ndellingwood I just pushed minor changes to address your comments, let me know what you think of them

Copy link
Contributor

@mhoemmen mhoemmen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Luc!

cusparse_Acolind(cusparse_Acolind_) {};

void doCopy() {
Kokkos::RangePolicy<execution_space, rowPtrTag> rowPtrPolicy(0, Arowptr.extent(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Tpetra::Details::copyOffsets. That would be a useful kokkos-kernels utility.

@@ -101,7 +143,7 @@ void struct_matvec(const int stencil_type,
const AMatrix& A,
const XVector& x,
typename YVector::const_value_type& beta,
const YVector& y,
YVector& y,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not const? It's a Kokkos::View.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could bring back the const, the intent is to show that this view could be modified, although its data is really what is being modified.

Acolind_cusparse[idx] = Acolind[idx];
});
cusparse_int_type Arowptr_cusparse("Arowptr", Arowptr.extent(0));
cusparse_int_type Acolind_cusparse("Acolind", Acolind.extent(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to allocate column indices unless they don't have type int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct but since this is a test I did not want to write too much logic around the type of the ordinal.
It would make the test a bit faster but this part is not timed and we already know that the cusparse interface is changing in future versions. Ideally this test should be rewritten down the road for the new interface that will make this a bit cleaner...

// using ordinal_type = typename AMatrix::non_const_ordinal_type;
using value_type = typename AMatrix::non_const_value_type;

#if defined(CUSPARSE_VERSION) && (10300 <= CUSPARSE_VERSION)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was CUDA 10.1 that added the new interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a newer "patch" update to 10.1 that added the new interface, 10.1.patch, but is not included in all versions of 10.1 (in particular not the version we have a sems module for nightly testing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit complicated, but the bottom line is that cuda and cusparse do not necessarily have the same version numbering.

Release number CUDA_VERSION CUSPARSE_VERSION
10.1.168 10100 undef
10.1.243 10100 10300

Note that in release 10.1.168, cusparse does define the following macros:

#define CUSPARSE_VER_MAJOR 10
#define CUSPARSE_VER_MINOR 2
#define CUSPARSE_VER_PATCH 0
#define CUSPARSE_VER_BUILD 0

where as in 10.1.243 it defines

#define CUSPARSE_VER_MAJOR 10
#define CUSPARSE_VER_MINOR 3
#define CUSPARSE_VER_PATCH 0
#define CUSPARSE_VER_BUILD 243
#define CUSPARSE_VERSION (CUSPARSE_VER_MAJOR * 1000 + \
                          CUSPARSE_VER_MINOR *  100 + \
                          CUSPARSE_VER_PATCH)

/* Initialize cusparse */
cusparseStatus_t cusparseStatus;
cusparseHandle_t cusparseHandle=0;
cusparseStatus = cusparseCreate(&cusparseHandle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should time this -- it could be expensive to create this handle.

if (std::is_same<value_type,float>::value) {
cusparseStatus = cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
A.numRows(), A.numCols(), A.nnz(),
(const float *) &alpha, descrA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer reinterpret_cast to C-style casts.

@ndellingwood
Copy link
Contributor

@mhoemmen do you request your comments be addressed prior to merge or as a follow up PR?

@mhoemmen
Copy link
Contributor

mhoemmen commented Mar 6, 2020

@ndellingwood I'm good with this -- Luc and I talked earlier this week. Sorry to delay you!

Copy link
Contributor

@srajama1 srajama1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lucbv : One comment. It would be nice to see how expensive handle creation is. As you pointed out it is good to make small PRs and keep incremental improvement. Can we also update the documentation ? I assume we have an issue that tracks this so @ndellingwood can document this in 3.1.

@ndellingwood
Copy link
Contributor

I assume we have an issue that tracks

@srajama1 yes, we missed adding this earlier, cross-reference issue #614

@ndellingwood
Copy link
Contributor

I'm good with this

@mhoemmen just wanted to check, always valuable feedback and didn't want to bypass anything requiring immediate address that was overlooked

It would be nice to see how expensive handle creation is

@srajama1 good call this would be useful, I'm going to go ahead with the merge (I think this is safe based on your follow up comment), this will get it through a round of nightlies (we have tpls in the nightly tests now :)

@ndellingwood ndellingwood merged commit 73aa853 into kokkos:develop Mar 6, 2020
@lucbv lucbv deleted the Spmv_cusparse_tpl branch March 6, 2020 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants