TpetraTSQR: Run on NVIDIA GPUs; ensure correct testing; refactor #6488

mhoemmen · 2019-12-21T06:25:05Z

@trilinos/tpetra @trilinos/belos @trilinos/anasazi @iyamazaki @ndellingwood

Motivation

Add TSQR back-end that uses cuSOLVER where appropriate to run on NVIDIA GPUs.
Use LAPACK's BLAS 3 optimizations for matrices with larger numbers of columns.
Make sure that tests are actually testing what we want to test.
Refactor implementation to simplify integration with Kokkos.
Remove dead code: e.g., the TBB TSQR implementation (that hasn't been tested since ~2010).

Stakeholder Feedback

This is mostly funded by the Pressio project, who specifically requested GPU support and optimizations for matrices with more columns. @fnrizzi may wish to comment. ECP ExaWind may also have an interest in TSQR for CA-GMRES. @iyamazaki may wish to comment.

Context

I first added this TSQR implementation to Trilinos in 2010. One learns something after a decade. Trilinos has also changed a lot, in particular due to the introduction of Kokkos (what Trilinos developers would call Kokkos >= 2.0; 1.0 was Chris Baker's 2009 project, and 0.0 was a library of computational kernels circa 2004-5). The lack of Kokkos at the time explains the odd mix of custom matrix views and raw pointers in the interface. I haven't tried to fix all that here.

"Virtualize" TSQR::Tsqr's use of NodeTsqr subclasses. 1. Tsqr is now no longer templated on the concrete NodeTsqr subclass type. 2. NodeTsqr no longer has a FactorOutput template parameter. 3. All NodeTsqr subclass' "FactorOutput" (returned by factor) types now inherit from a base class. NodeTsqr subclasses' implementations of apply etc. must now dynamic_cast from that base class to their concrete "FactorOutput" type. 4. Change Epetra, Tpetra, and Stokhos specializations of TsqrAdaptor to remove the third template argument of TSQR::Tsqr. (3) above breaks TbbTsqr, but we haven't tested that for nearly a decade so it may not work anyway. The goal is to support subclasses of NodeTsqr that use TPLs like cuSOLVER. In order to do that, we need to protect downstream code from TPL includes. This means virtualizing both all use of NodeTsqr, and the return type of NodeTsqr::factor.

Tpetra has deprecated and will remove Node types. Help speed this process by referring directly to device_type etc. instead of node_type.

NodeTsqrFactory now is actually a factory: it has a static getNodeTsqr method that uses run-time information (the Kokkos execution space's concurrency()) to decide what NodeTsqr subclass type to return. There are two goals: 1. Use KokkosNodeTsqr where possible, for CPU thread parallelism. 2. Later, to enable use of a cuSOLVER-based NodeTsqr implementation.

1. Make "Full" TSQR test initialize and finalize Kokkos. 2. Add more debug printing to ensure that the NodeTsqr subclass type is actually KokkosNodeTsqr when that's appropriate.

NOTE: CMakeLists.txt currently sets --alwaysUseSequentialTsqr --noTestComplex. Otherwise, the test won't pass. Even with --alwaysUseSequentialTsqr, the test FAILS with Scalar=complex<{float,double}>, even when using SequentialTsqr. (You can exercise this without changing CMakeLists.txt or command-line arguments by setting OMP_NUM_THREADS=1.) (It passes for ALL Scalar types with 100 rows and 5 columns.) This is why we set --noTestComplex by default in CMakeLists.txt. Without --alwaysUseSequentialTsqr, the test FAILS with ALL Scalar types when using KokkosNodeTsqr with number of rows = 10000 and number of columns = 5. (It passes for ALL Scalar types with 100 rows and 5 columns.) This is why we set --alwaysUseSequentialTsqr by default in CMakeLists.txt. We aim to fix these issues.

Making the methods of Combine nonconst makes it correct for Combine to use CombineDefault as the type of impl_. (We discovered this issue by changing the Combine test to exercise multiple impl_ types.) This requires changes to KokkosNodeTsqr. Those changes are not related to thread safety, though, since each operator() invocation for both the factor and apply kernels creates a separate Combine instance.

The Combine test now exercises both CombineDefault and CombineNative. Before, it was only exercising CombineNative.

CombineNodeTsqr just uses Combine. Make NodeTsqrFactory return CombineNodeTsqr in the complex case. I had hopes that this would fix the complex case of the full TSQR test, but it doesn't. I plan to do the following: 1. Add a separate test for CombineNodeTsqr (it's interesting that the rank-revealing part of the full TSQR test failed -- it reported a rank of 0), and 2. Improve the DistTsqr test.

Also add --NodeTsqr command-line argument to full test, and get rid of that test's --alwaysUseSequentialTsqr option (in favor of setting --NodeTsqr=SequentialTsqr).

It's currently just an executable; it doesn't get run with ctest yet.

Improve NodeTsqr test output in other ways as well.

CombineNodeTsqr::factor was not copying the R factor out of the factored matrix A. This is why the R factor was showing up as all zeros.

Now the test can actually fail; we tested this.

1. Remove SequentialTsqr-specific test executable. 2. Fix minor issue in generic NodeTsqr test when using contiguous cache blocks.

KokkosNodeTsqr isn't working quite yet, so I changed NodeTsqrFactory so that KokkosNodeTsqr is never the default NodeTsqr type. Users can still request KokkosNodeTsqr explicitly by name. This change made it possible for me to remove a work-around in the full TSQR tests, since the default NodeTsqr type is now correct for all Scalar and Device types.

1. Make Combine::apply_pair take MatView instead of raw pointers. 2. Remove all TbbTsqr-related files and code. (2) is related to (1); we don't have testing for TbbTsqr any more so there's no way to test whether any changes we made to Combine's interface might have broken TbbTsqr.

Make Combine::apply_pair take MatView instead of raw pointers. This completes "MatView-ization" of Combine and its implementations. That in turn serves our end goal of letting us use TPLs like cuSOLVER for the intraprocess part of TSQR.

Use partition_2x1 instead of assuming column-major layout, in an implementation detail of SequentialTsqr.

mhoemmen · 2019-12-23T17:35:28Z

@alanw0 wrote:

In stk/sierra we try to make sure that each commit is self-sufficient, meaning it builds and passes tests so that we preserve the ability to bisect. I can't tell if these commits maintain that, or if that is a goal here.

That was absolutely the goal. Almost all commits have that property at least on the computer on which I was testing. That's the main reason why there are so many commits -- I wanted to move always from a state of "passing tests" to another state of "passing tests."

kddevin

We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.

mhoemmen · 2019-12-23T17:40:02Z

@kddevin wrote:

We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.

I'm just waiting for a complete kokkos-kernels BLAS. I'll be happy to switch once it's ready and I know that it's actually calling TPLs.

mhoemmen · 2019-12-23T17:40:41Z

@kddevin Also, the point of these changes is cuSOLVER, not cuBLAS. Not all platforms will necessarily have an accelerator-based LAPACK.

trilinos-autotester · 2019-12-23T20:52:00Z

Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED

Pull Request Auto Testing has PASSED (click to expand)

Build Information

Test Name: Trilinos_pullrequest_gcc_4.8.4

Build Num: 5258
Status: PASSED

Jenkins Parameters

Parameter Name	Value
COMPILER_MODULE	sems-gcc/4.8.4
JENKINS_BUILD_TYPE	Release
JENKINS_COMM_TYPE	MPI
JENKINS_DO_COMPLEX	OFF
JENKINS_JOB_TYPE	Experimental
MPI_MODULE	sems-openmpi/1.8.7
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

Build Information

Test Name: Trilinos_pullrequest_intel_17.0.1

Build Num: 5083
Status: PASSED

Jenkins Parameters

Parameter Name	Value
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

Build Information

Test Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL

Build Num: 3512
Status: PASSED

Jenkins Parameters

Parameter Name	Value
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

Build Information

Test Name: Trilinos_pullrequest_gcc_7.2.0

Build Num: 3360
Status: PASSED

Jenkins Parameters

Parameter Name	Value
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

Build Information

Test Name: Trilinos_pullrequest_cuda_9.2

Build Num: 2914
Status: PASSED

Jenkins Parameters

Parameter Name	Value
JENKINS_JOB_TYPE	Experimental
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

Build Information

Test Name: Trilinos_pullrequest_python_2

Build Num: 1245
Status: PASSED

Jenkins Parameters

Parameter Name	Value
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

Build Information

Test Name: Trilinos_pullrequest_python_3

Build Num: 1245
Status: PASSED

Jenkins Parameters

Parameter Name	Value
PULLREQUESTNUM	6488
TEST_REPO_ALIAS	TRILINOS
TRILINOS_SOURCE_BRANCH	TSQR-Dec2019
TRILINOS_SOURCE_REPO	https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA	`842320a`
TRILINOS_TARGET_BRANCH	develop
TRILINOS_TARGET_REPO	https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA	`c8e4314`

CDash Test Results for PR# 6488.

trilinos-autotester · 2019-12-23T20:52:19Z

Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ alanw0 ]!

trilinos-autotester · 2019-12-23T20:52:27Z

Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged

trilinos-autotester · 2019-12-23T20:52:31Z

Merge on Pull Request# 6488: IS A SUCCESS - Pull Request successfully merged

mhoemmen · 2019-12-23T22:03:17Z

Thanks all! :-D @kddevin btw I figured out how to make TriBITS automatically detect CUBLAS and CUSOLVER -- by using TRIBITS_TPL_TENTATIVELY_ENABLE(CUBLAS) etc. in packages/tpetra/tsqr/cmake/Dependencies.cmake. I'll put up a new PR with that change soon.

srajama1 · 2020-01-03T19:06:07Z

@kddevin wrote:

We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.

I'm just waiting for a complete kokkos-kernels BLAS. I'll be happy to switch once it's ready and I know that it's actually calling TPLs.

The BLAS/LAPACK calls are added on a need-only basis. We are happy to add the calls needed by Tpetra. Please file a GitHub issue with the calls needed.

kddevin · 2020-01-10T23:23:36Z

@mhoemmen In the TSQR code, I see the following LAPACK calls:

UNGQR (in compute_explicit_Q_lwork and compute_explicit_Q)
GEQRF (in compute_QR_lwork and compute_QR)
UNMQR (in apply_Q_factor_lwork and apply_Q_factor)
LARFG (in factor_inner and factor_pair)
GESVD (in reveal_R_rank)
POTRF (in factor)
LARNV (in fill_buffer)

Are all of these needed in KokkosKernels? I see only the following reimplemented for CUDA in Tsqr_Impl_CuSolver.cpp:

apply_Q_factor, apply_Q_factor_lwork
compute_QR, compute_QR_lwork
compute_explicit_Q. compute_explicit_Q_lwork

Does that mean we would need to request only the following from KokkosKernels?

UNGQR (in compute_explicit_Q_lwork and compute_explicit_Q)
GEQRF (in compute_QR_lwork and compute_QR)
UNMQR (in apply_Q_factor_lwork and apply_Q_factor)

Thanks @mhoemmen .

kddevin · 2020-01-10T23:28:16Z

Continuing to @mhoemmen
Or is the NodeTsqr class the more appropriate level of abstraction for KokkosKernels (with CuSolverNodeTsqr being one implementation)? That is, does NodeTsqr fit more appropriately into KokkosKernels than Tpetra? Thanks!

mhoemmen · 2020-01-13T17:20:59Z

@kddevin Are LAPACK functions actually in-scope for kokkos-kernels? It seemed a bit too much to ask for LAPACK when there were BLAS functions left to implement.

kddevin · 2020-01-13T18:49:40Z

Given the name of the class (NodeTsqr), should the capability be in KokkosKernels rather than in Tpetra? The comments say the TSQR implementation is not Tpetra-specific and, indeed, no Tpetra data structures are used in the implementation. Probably the entire capability should be moved into KokkosKernels or one of the solver packages (Belos, ShyLu, etc.).

srajama1 · 2020-01-13T19:55:39Z

@kddevin Are LAPACK functions actually in-scope for kokkos-kernels? It seemed a bit too much to ask for LAPACK when there were BLAS functions left to implement.

Yes, we do support some LAPACK functionality at the team level. It is driven by user requests. It is going to take a long time to achieve complete support.

Sorry, @mhoemmen I edited your comment by mistake instead of quoting it. Fixed it back.

kddevin · 2020-01-13T20:10:11Z

@srajama1 Is a NodeTsqr appropriate for KokkosKernels or a better fit for something like ShyLu?

mhoemmen · 2020-01-13T21:19:56Z

@kddevin I don't really care where things live, as long as they get enabled and tested by default. There's a risk putting stuff too far downstream in packages like ShyLU, that historically had (have?) components that weren't enabled or tested by default. Also, TSQR is out of scope for ShyLU.

srajama1 · 2020-01-13T21:32:34Z

@kddevin I don't have particular preference for where TSQR lives. That is primarily the decision of the developer given everything else is equal. However, ShyLU is not the right place (ShyLU_Node is mainly sparse factorizations and solvers needed by DD methods). Kokkos Kernels might be ok, Belos might be ok. Essentially, the questions are

Where are we going to put other variations like CholQR ?
Where would an user go looking for this kernel logically ?
What dependencies need to be satisfied ?

@iyamazaki might have some thoughts as well.

mhoemmen · 2020-01-14T00:20:46Z

TSQR depends on MPI as well as Kokkos. This makes it out of scope for kokkos-kernels.

srajama1 · 2020-01-15T03:37:23Z

That leaves Belos or Tpetra then. May be a silly question, if it depends on MPI, why is it NodeTSQR ? I guess TSQR depends on MPI and NodeTSQR doesn't. I assumed Karen was asking about NodeTSQR.

kddevin · 2020-01-15T19:55:08Z

There's NodeTSQR and DistTSQR. Probably doesn't make sense to separate them. I will file a KokkosKernels request for the needed support.

kddevin · 2020-01-15T21:36:15Z

kokkos/kokkos-kernels#567
Requested KokkosKernels support for cusolve calls.

Mark Hoemmen added 30 commits December 19, 2019 12:46

TSQR: Remove more use of Node from TSQR adapters

05ef854

TSQR: Remove dead method prepareNodeTsqr

66f3cca

TSQR: Remove dead code from NodeTsqrFactory

91ae70a

TSQR::NodeTsqrFactory no longer refers to node_type

6da8b0c

Tpetra has deprecated and will remove Node types. Help speed this process by referring directly to device_type etc. instead of node_type.

TSQR: Make sure "full" test exercises KokkosNodeTsqr

157fe6b

1. Make "Full" TSQR test initialize and finalize Kokkos. 2. Add more debug printing to ensure that the NodeTsqr subclass type is actually KokkosNodeTsqr when that's appropriate.

TSQR: Improve Combine test; add more test cases

0071167

TSQR: Make Combine test exercise both Combine implementations

70b27aa

The Combine test now exercises both CombineDefault and CombineNative. Before, it was only exercising CombineNative.

TSQR: Fix Clang build warnings in {KokkosNode,Sequential}Tsqr

d8752dc

TSQR: In full test, check if R factor is all zeros

517da93

TSQR::NodeTsqrFactory: Let user specify NodeTsqr type as string

c4fe0bd

Also add --NodeTsqr command-line argument to full test, and get rid of that test's --alwaysUseSequentialTsqr option (in favor of setting --NodeTsqr=SequentialTsqr).

TSQR: Add generic NodeTsqr test

aa92092

It's currently just an executable; it doesn't get run with ctest yet.

TSQR: Minor changes to generic NodeTsqr test

bd40849

TSQR: Make Combine test output consistent with NodeTsqr test

d6689cf

TSQR: Fix saveMatrices option in generic NodeTsqr test

d88d48f

Improve NodeTsqr test output in other ways as well.

TSQR::CombineNodeTsqr::factor: Fix bug

d2688e8

CombineNodeTsqr::factor was not copying the R factor out of the factored matrix A. This is why the R factor was showing up as all zeros.

TSQR::SequentialTsqr: Minor fix (not affecting tests)

a83f5d9

TSQR: Add accuracy bounds to generic NodeTsqr test

f5dc094

Now the test can actually fail; we tested this.

TSQR: Use generic NodeTsqr test to test SequentialTsqr

3cf037c

1. Remove SequentialTsqr-specific test executable. 2. Fix minor issue in generic NodeTsqr test when using contiguous cache blocks.

TSQR: Remove redundant test files

cecc610

TSQR::CombineNative: Add type aliases to make code more readable

7fa8a66

TSQR: Purge any leftover TBB- or TbbTsqr-related code

ea36c1f

TSQR: Change Combine::apply_inner to take MatView

f98fb80

Make Combine::apply_pair take MatView instead of raw pointers. This completes "MatView-ization" of Combine and its implementations. That in turn serves our end goal of letting us use TPLs like cuSOLVER for the intraprocess part of TSQR.

TSQR::SequentialTsqr: Remove dependency on MatView layout

b1bf6ab

Use partition_2x1 instead of assuming column-major layout, in an implementation detail of SequentialTsqr.

kddevin reviewed Dec 23, 2019

View reviewed changes

trilinos-autotester removed the AT: RETEST Causes the PR autotester to run a new round of PR tests on the next iteration label Dec 23, 2019

trilinos-autotester merged commit 9c0d1c6 into trilinos:develop Dec 23, 2019

trilinos-autotester removed the AT: AUTOMERGE Causes the PR autotester to automatically merge the PR branch once approvals are completed label Dec 23, 2019

mhoemmen deleted the TSQR-Dec2019 branch December 23, 2019 21:17

mhoemmen mentioned this pull request Dec 24, 2019

TSQR: Automatically detect CUBLAS & CUSOLVER TPLs; improve TPL handle wrappers #6496

Closed

bartlettroscoe mentioned this pull request Jan 3, 2020

TpetraTSQR link errors in all ATDM Trilinos CUDA+RDC builds starting 2019-12-24 #6517

Closed

srajama1 mentioned this pull request Jan 8, 2020

KokkosKernels: Fix multiple gemm issues with complex type #6472

Merged

mhoemmen mentioned this pull request Jan 14, 2020

TSQR: Automatically detect CUBLAS & CUSOLVER TPLs; improve TPL handle wrappers #6583

Merged

TpetraTSQR: Run on NVIDIA GPUs; ensure correct testing; refactor #6488

TpetraTSQR: Run on NVIDIA GPUs; ensure correct testing; refactor #6488

Conversation

mhoemmen commented Dec 21, 2019 • edited Loading

Motivation

Stakeholder Feedback

Context

mhoemmen commented Dec 23, 2019

kddevin left a comment

Choose a reason for hiding this comment

mhoemmen commented Dec 23, 2019

mhoemmen commented Dec 23, 2019

trilinos-autotester commented Dec 23, 2019

Build Information

Test Name: Trilinos_pullrequest_gcc_4.8.4

Jenkins Parameters

Build Information

Test Name: Trilinos_pullrequest_intel_17.0.1

Jenkins Parameters

Build Information

Test Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL

Jenkins Parameters

Build Information

Test Name: Trilinos_pullrequest_gcc_7.2.0

Jenkins Parameters

Build Information

Test Name: Trilinos_pullrequest_cuda_9.2

Jenkins Parameters

Build Information

Test Name: Trilinos_pullrequest_python_2

Jenkins Parameters

Build Information

Test Name: Trilinos_pullrequest_python_3

Jenkins Parameters

trilinos-autotester commented Dec 23, 2019

trilinos-autotester commented Dec 23, 2019

trilinos-autotester commented Dec 23, 2019

mhoemmen commented Dec 23, 2019

srajama1 commented Jan 3, 2020

kddevin commented Jan 10, 2020

kddevin commented Jan 10, 2020 • edited Loading

mhoemmen commented Jan 13, 2020 • edited by srajama1 Loading

kddevin commented Jan 13, 2020

srajama1 commented Jan 13, 2020 • edited Loading

kddevin commented Jan 13, 2020

mhoemmen commented Jan 13, 2020

srajama1 commented Jan 13, 2020

mhoemmen commented Jan 14, 2020

srajama1 commented Jan 15, 2020

kddevin commented Jan 15, 2020

kddevin commented Jan 15, 2020

mhoemmen commented Dec 21, 2019 •

edited

Loading

kddevin commented Jan 10, 2020 •

edited

Loading

mhoemmen commented Jan 13, 2020 •

edited by srajama1

Loading

srajama1 commented Jan 13, 2020 •

edited

Loading