Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TpetraTSQR: Run on NVIDIA GPUs; ensure correct testing; refactor #6488

Merged
merged 101 commits into from
Dec 23, 2019

Conversation

mhoemmen
Copy link
Contributor

@mhoemmen mhoemmen commented Dec 21, 2019

@trilinos/tpetra @trilinos/belos @trilinos/anasazi @iyamazaki @ndellingwood

Motivation

  1. Add TSQR back-end that uses cuSOLVER where appropriate to run on NVIDIA GPUs.
  2. Use LAPACK's BLAS 3 optimizations for matrices with larger numbers of columns.
  3. Make sure that tests are actually testing what we want to test.
  4. Refactor implementation to simplify integration with Kokkos.
  5. Remove dead code: e.g., the TBB TSQR implementation (that hasn't been tested since ~2010).

Stakeholder Feedback

This is mostly funded by the Pressio project, who specifically requested GPU support and optimizations for matrices with more columns. @fnrizzi may wish to comment. ECP ExaWind may also have an interest in TSQR for CA-GMRES. @iyamazaki may wish to comment.

Context

I first added this TSQR implementation to Trilinos in 2010. One learns something after a decade. Trilinos has also changed a lot, in particular due to the introduction of Kokkos (what Trilinos developers would call Kokkos >= 2.0; 1.0 was Chris Baker's 2009 project, and 0.0 was a library of computational kernels circa 2004-5). The lack of Kokkos at the time explains the odd mix of custom matrix views and raw pointers in the interface. I haven't tried to fix all that here.

Mark Hoemmen added 30 commits December 19, 2019 12:46
"Virtualize" TSQR::Tsqr's use of NodeTsqr subclasses.

  1. Tsqr is now no longer templated on the concrete NodeTsqr subclass
     type.
  2. NodeTsqr no longer has a FactorOutput template parameter.
  3. All NodeTsqr subclass' "FactorOutput" (returned by factor) types
     now inherit from a base class.  NodeTsqr subclasses'
     implementations of apply etc. must now dynamic_cast from that
     base class to their concrete "FactorOutput" type.
  4. Change Epetra, Tpetra, and Stokhos specializations of TsqrAdaptor
     to remove the third template argument of TSQR::Tsqr.

(3) above breaks TbbTsqr, but we haven't tested that for nearly a
decade so it may not work anyway.

The goal is to support subclasses of NodeTsqr that use TPLs like
cuSOLVER.  In order to do that, we need to protect downstream code
from TPL includes.  This means virtualizing both all use of NodeTsqr,
and the return type of NodeTsqr::factor.
Tpetra has deprecated and will remove Node types.  Help speed this
process by referring directly to device_type etc. instead of
node_type.
NodeTsqrFactory now is actually a factory: it has a static getNodeTsqr
method that uses run-time information (the Kokkos execution space's
concurrency()) to decide what NodeTsqr subclass type to return.

There are two goals:

  1. Use KokkosNodeTsqr where possible, for CPU thread parallelism.
  2. Later, to enable use of a cuSOLVER-based NodeTsqr implementation.
1. Make "Full" TSQR test initialize and finalize Kokkos.
2. Add more debug printing to ensure that the NodeTsqr
   subclass type is actually KokkosNodeTsqr when that's
   appropriate.
NOTE: CMakeLists.txt currently sets --alwaysUseSequentialTsqr
--noTestComplex.  Otherwise, the test won't pass.

Even with --alwaysUseSequentialTsqr, the test FAILS with
Scalar=complex<{float,double}>, even when using SequentialTsqr.  (You
can exercise this without changing CMakeLists.txt or command-line
arguments by setting OMP_NUM_THREADS=1.)  (It passes for ALL Scalar
types with 100 rows and 5 columns.)  This is why we set
--noTestComplex by default in CMakeLists.txt.

Without --alwaysUseSequentialTsqr, the test FAILS with ALL Scalar
types when using KokkosNodeTsqr with number of rows = 10000 and number
of columns = 5.  (It passes for ALL Scalar types with 100 rows and 5
columns.)  This is why we set --alwaysUseSequentialTsqr by default in
CMakeLists.txt.

We aim to fix these issues.
Making the methods of Combine nonconst makes it correct for Combine to
use CombineDefault as the type of impl_.  (We discovered this issue by
changing the Combine test to exercise multiple impl_ types.)

This requires changes to KokkosNodeTsqr.  Those changes are not
related to thread safety, though, since each operator() invocation for
both the factor and apply kernels creates a separate Combine instance.
The Combine test now exercises both CombineDefault and CombineNative.
Before, it was only exercising CombineNative.
CombineNodeTsqr just uses Combine.  Make NodeTsqrFactory return
CombineNodeTsqr in the complex case.  I had hopes that this would fix
the complex case of the full TSQR test, but it doesn't.  I plan to do
the following:

  1. Add a separate test for CombineNodeTsqr (it's interesting that
     the rank-revealing part of the full TSQR test failed -- it
     reported a rank of 0), and

  2. Improve the DistTsqr test.
Also add --NodeTsqr command-line argument to full test,
and get rid of that test's --alwaysUseSequentialTsqr option
(in favor of setting --NodeTsqr=SequentialTsqr).
It's currently just an executable; it doesn't get run with ctest yet.
Improve NodeTsqr test output in other ways as well.
CombineNodeTsqr::factor was not copying the R factor out of the
factored matrix A.  This is why the R factor was showing up as
all zeros.
Now the test can actually fail; we tested this.
1. Remove SequentialTsqr-specific test executable.
2. Fix minor issue in generic NodeTsqr test when using
   contiguous cache blocks.
KokkosNodeTsqr isn't working quite yet, so I changed NodeTsqrFactory
so that KokkosNodeTsqr is never the default NodeTsqr type.  Users can
still request KokkosNodeTsqr explicitly by name.

This change made it possible for me to remove a work-around in the
full TSQR tests, since the default NodeTsqr type is now correct for
all Scalar and Device types.
  1. Make Combine::apply_pair take MatView instead of raw pointers.
  2. Remove all TbbTsqr-related files and code.

(2) is related to (1); we don't have testing for TbbTsqr any more
so there's no way to test whether any changes we made to Combine's
interface might have broken TbbTsqr.
Make Combine::apply_pair take MatView instead of raw pointers.
This completes "MatView-ization" of Combine and its implementations.
That in turn serves our end goal of letting us use TPLs like
cuSOLVER for the intraprocess part of TSQR.
Use partition_2x1 instead of assuming column-major layout,
in an implementation detail of SequentialTsqr.
@mhoemmen
Copy link
Contributor Author

@alanw0 wrote:

In stk/sierra we try to make sure that each commit is self-sufficient, meaning it builds and passes tests so that we preserve the ability to bisect. I can't tell if these commits maintain that, or if that is a goal here.

That was absolutely the goal. Almost all commits have that property at least on the computer on which I was testing. That's the main reason why there are so many commits -- I wanted to move always from a state of "passing tests" to another state of "passing tests."

Copy link
Contributor

@kddevin kddevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.

@mhoemmen
Copy link
Contributor Author

@kddevin wrote:

We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.

I'm just waiting for a complete kokkos-kernels BLAS. I'll be happy to switch once it's ready and I know that it's actually calling TPLs.

@mhoemmen
Copy link
Contributor Author

@kddevin Also, the point of these changes is cuSOLVER, not cuBLAS. Not all platforms will necessarily have an accelerator-based LAPACK.

@trilinos-autotester
Copy link
Contributor

Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED

Pull Request Auto Testing has PASSED (click to expand)

Build Information

Test Name: Trilinos_pullrequest_gcc_4.8.4

  • Build Num: 5258
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
COMPILER_MODULE sems-gcc/4.8.4
JENKINS_BUILD_TYPE Release
JENKINS_COMM_TYPE MPI
JENKINS_DO_COMPLEX OFF
JENKINS_JOB_TYPE Experimental
MPI_MODULE sems-openmpi/1.8.7
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314

Build Information

Test Name: Trilinos_pullrequest_intel_17.0.1

  • Build Num: 5083
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314

Build Information

Test Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL

  • Build Num: 3512
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314

Build Information

Test Name: Trilinos_pullrequest_gcc_7.2.0

  • Build Num: 3360
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314

Build Information

Test Name: Trilinos_pullrequest_cuda_9.2

  • Build Num: 2914
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
JENKINS_JOB_TYPE Experimental
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314

Build Information

Test Name: Trilinos_pullrequest_python_2

  • Build Num: 1245
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314

Build Information

Test Name: Trilinos_pullrequest_python_3

  • Build Num: 1245
  • Status: PASSED

Jenkins Parameters

Parameter Name Value
PULLREQUESTNUM 6488
TEST_REPO_ALIAS TRILINOS
TRILINOS_SOURCE_BRANCH TSQR-Dec2019
TRILINOS_SOURCE_REPO https://github.com/mhoemmen/Trilinos
TRILINOS_SOURCE_SHA 842320a
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA c8e4314


CDash Test Results for PR# 6488.

@trilinos-autotester trilinos-autotester removed the AT: RETEST Causes the PR autotester to run a new round of PR tests on the next iteration label Dec 23, 2019
@trilinos-autotester
Copy link
Contributor

Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ alanw0 ]!

@trilinos-autotester
Copy link
Contributor

Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged

@trilinos-autotester trilinos-autotester merged commit 9c0d1c6 into trilinos:develop Dec 23, 2019
@trilinos-autotester
Copy link
Contributor

Merge on Pull Request# 6488: IS A SUCCESS - Pull Request successfully merged

@trilinos-autotester trilinos-autotester removed the AT: AUTOMERGE Causes the PR autotester to automatically merge the PR branch once approvals are completed label Dec 23, 2019
@mhoemmen mhoemmen deleted the TSQR-Dec2019 branch December 23, 2019 21:17
@mhoemmen
Copy link
Contributor Author

Thanks all! :-D @kddevin btw I figured out how to make TriBITS automatically detect CUBLAS and CUSOLVER -- by using TRIBITS_TPL_TENTATIVELY_ENABLE(CUBLAS) etc. in packages/tpetra/tsqr/cmake/Dependencies.cmake. I'll put up a new PR with that change soon.

@srajama1
Copy link
Contributor

srajama1 commented Jan 3, 2020

@kddevin wrote:

We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.

I'm just waiting for a complete kokkos-kernels BLAS. I'll be happy to switch once it's ready and I know that it's actually calling TPLs.

The BLAS/LAPACK calls are added on a need-only basis. We are happy to add the calls needed by Tpetra. Please file a GitHub issue with the calls needed.

@kddevin
Copy link
Contributor

kddevin commented Jan 10, 2020

@mhoemmen In the TSQR code, I see the following LAPACK calls:

  • UNGQR (in compute_explicit_Q_lwork and compute_explicit_Q)
  • GEQRF (in compute_QR_lwork and compute_QR)
  • UNMQR (in apply_Q_factor_lwork and apply_Q_factor)
  • LARFG (in factor_inner and factor_pair)
  • GESVD (in reveal_R_rank)
  • POTRF (in factor)
  • LARNV (in fill_buffer)

Are all of these needed in KokkosKernels? I see only the following reimplemented for CUDA in Tsqr_Impl_CuSolver.cpp:

  • apply_Q_factor, apply_Q_factor_lwork
  • compute_QR, compute_QR_lwork
  • compute_explicit_Q. compute_explicit_Q_lwork

Does that mean we would need to request only the following from KokkosKernels?

  • UNGQR (in compute_explicit_Q_lwork and compute_explicit_Q)
  • GEQRF (in compute_QR_lwork and compute_QR)
  • UNMQR (in apply_Q_factor_lwork and apply_Q_factor)

Thanks @mhoemmen .

@kddevin
Copy link
Contributor

kddevin commented Jan 10, 2020

Continuing to @mhoemmen
Or is the NodeTsqr class the more appropriate level of abstraction for KokkosKernels (with CuSolverNodeTsqr being one implementation)? That is, does NodeTsqr fit more appropriately into KokkosKernels than Tpetra? Thanks!

@mhoemmen
Copy link
Contributor Author

mhoemmen commented Jan 13, 2020

@kddevin Are LAPACK functions actually in-scope for kokkos-kernels? It seemed a bit too much to ask for LAPACK when there were BLAS functions left to implement.

@kddevin
Copy link
Contributor

kddevin commented Jan 13, 2020

Given the name of the class (NodeTsqr), should the capability be in KokkosKernels rather than in Tpetra? The comments say the TSQR implementation is not Tpetra-specific and, indeed, no Tpetra data structures are used in the implementation. Probably the entire capability should be moved into KokkosKernels or one of the solver packages (Belos, ShyLu, etc.).

@srajama1
Copy link
Contributor

srajama1 commented Jan 13, 2020

@kddevin Are LAPACK functions actually in-scope for kokkos-kernels? It seemed a bit too much to ask for LAPACK when there were BLAS functions left to implement.

Yes, we do support some LAPACK functionality at the team level. It is driven by user requests. It is going to take a long time to achieve complete support.

Sorry, @mhoemmen I edited your comment by mistake instead of quoting it. Fixed it back.

@kddevin
Copy link
Contributor

kddevin commented Jan 13, 2020

@srajama1 Is a NodeTsqr appropriate for KokkosKernels or a better fit for something like ShyLu?

@mhoemmen
Copy link
Contributor Author

@kddevin I don't really care where things live, as long as they get enabled and tested by default. There's a risk putting stuff too far downstream in packages like ShyLU, that historically had (have?) components that weren't enabled or tested by default. Also, TSQR is out of scope for ShyLU.

@srajama1
Copy link
Contributor

@kddevin I don't have particular preference for where TSQR lives. That is primarily the decision of the developer given everything else is equal. However, ShyLU is not the right place (ShyLU_Node is mainly sparse factorizations and solvers needed by DD methods). Kokkos Kernels might be ok, Belos might be ok. Essentially, the questions are

  1. Where are we going to put other variations like CholQR ?
  2. Where would an user go looking for this kernel logically ?
  3. What dependencies need to be satisfied ?

@iyamazaki might have some thoughts as well.

@mhoemmen
Copy link
Contributor Author

TSQR depends on MPI as well as Kokkos. This makes it out of scope for kokkos-kernels.

@srajama1
Copy link
Contributor

That leaves Belos or Tpetra then. May be a silly question, if it depends on MPI, why is it NodeTSQR ? I guess TSQR depends on MPI and NodeTSQR doesn't. I assumed Karen was asking about NodeTSQR.

@kddevin
Copy link
Contributor

kddevin commented Jan 15, 2020

There's NodeTSQR and DistTSQR. Probably doesn't make sense to separate them. I will file a KokkosKernels request for the needed support.

@kddevin
Copy link
Contributor

kddevin commented Jan 15, 2020

kokkos/kokkos-kernels#567
Requested KokkosKernels support for cusolve calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants