-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TpetraTSQR: Run on NVIDIA GPUs; ensure correct testing; refactor #6488
TpetraTSQR: Run on NVIDIA GPUs; ensure correct testing; refactor #6488
Conversation
"Virtualize" TSQR::Tsqr's use of NodeTsqr subclasses. 1. Tsqr is now no longer templated on the concrete NodeTsqr subclass type. 2. NodeTsqr no longer has a FactorOutput template parameter. 3. All NodeTsqr subclass' "FactorOutput" (returned by factor) types now inherit from a base class. NodeTsqr subclasses' implementations of apply etc. must now dynamic_cast from that base class to their concrete "FactorOutput" type. 4. Change Epetra, Tpetra, and Stokhos specializations of TsqrAdaptor to remove the third template argument of TSQR::Tsqr. (3) above breaks TbbTsqr, but we haven't tested that for nearly a decade so it may not work anyway. The goal is to support subclasses of NodeTsqr that use TPLs like cuSOLVER. In order to do that, we need to protect downstream code from TPL includes. This means virtualizing both all use of NodeTsqr, and the return type of NodeTsqr::factor.
Tpetra has deprecated and will remove Node types. Help speed this process by referring directly to device_type etc. instead of node_type.
NodeTsqrFactory now is actually a factory: it has a static getNodeTsqr method that uses run-time information (the Kokkos execution space's concurrency()) to decide what NodeTsqr subclass type to return. There are two goals: 1. Use KokkosNodeTsqr where possible, for CPU thread parallelism. 2. Later, to enable use of a cuSOLVER-based NodeTsqr implementation.
1. Make "Full" TSQR test initialize and finalize Kokkos. 2. Add more debug printing to ensure that the NodeTsqr subclass type is actually KokkosNodeTsqr when that's appropriate.
NOTE: CMakeLists.txt currently sets --alwaysUseSequentialTsqr --noTestComplex. Otherwise, the test won't pass. Even with --alwaysUseSequentialTsqr, the test FAILS with Scalar=complex<{float,double}>, even when using SequentialTsqr. (You can exercise this without changing CMakeLists.txt or command-line arguments by setting OMP_NUM_THREADS=1.) (It passes for ALL Scalar types with 100 rows and 5 columns.) This is why we set --noTestComplex by default in CMakeLists.txt. Without --alwaysUseSequentialTsqr, the test FAILS with ALL Scalar types when using KokkosNodeTsqr with number of rows = 10000 and number of columns = 5. (It passes for ALL Scalar types with 100 rows and 5 columns.) This is why we set --alwaysUseSequentialTsqr by default in CMakeLists.txt. We aim to fix these issues.
Making the methods of Combine nonconst makes it correct for Combine to use CombineDefault as the type of impl_. (We discovered this issue by changing the Combine test to exercise multiple impl_ types.) This requires changes to KokkosNodeTsqr. Those changes are not related to thread safety, though, since each operator() invocation for both the factor and apply kernels creates a separate Combine instance.
The Combine test now exercises both CombineDefault and CombineNative. Before, it was only exercising CombineNative.
CombineNodeTsqr just uses Combine. Make NodeTsqrFactory return CombineNodeTsqr in the complex case. I had hopes that this would fix the complex case of the full TSQR test, but it doesn't. I plan to do the following: 1. Add a separate test for CombineNodeTsqr (it's interesting that the rank-revealing part of the full TSQR test failed -- it reported a rank of 0), and 2. Improve the DistTsqr test.
Also add --NodeTsqr command-line argument to full test, and get rid of that test's --alwaysUseSequentialTsqr option (in favor of setting --NodeTsqr=SequentialTsqr).
It's currently just an executable; it doesn't get run with ctest yet.
Improve NodeTsqr test output in other ways as well.
CombineNodeTsqr::factor was not copying the R factor out of the factored matrix A. This is why the R factor was showing up as all zeros.
Now the test can actually fail; we tested this.
1. Remove SequentialTsqr-specific test executable. 2. Fix minor issue in generic NodeTsqr test when using contiguous cache blocks.
KokkosNodeTsqr isn't working quite yet, so I changed NodeTsqrFactory so that KokkosNodeTsqr is never the default NodeTsqr type. Users can still request KokkosNodeTsqr explicitly by name. This change made it possible for me to remove a work-around in the full TSQR tests, since the default NodeTsqr type is now correct for all Scalar and Device types.
1. Make Combine::apply_pair take MatView instead of raw pointers. 2. Remove all TbbTsqr-related files and code. (2) is related to (1); we don't have testing for TbbTsqr any more so there's no way to test whether any changes we made to Combine's interface might have broken TbbTsqr.
Make Combine::apply_pair take MatView instead of raw pointers. This completes "MatView-ization" of Combine and its implementations. That in turn serves our end goal of letting us use TPLs like cuSOLVER for the intraprocess part of TSQR.
Use partition_2x1 instead of assuming column-major layout, in an implementation detail of SequentialTsqr.
@alanw0 wrote:
That was absolutely the goal. Almost all commits have that property at least on the computer on which I was testing. That's the main reason why there are so many commits -- I wanted to move always from a state of "passing tests" to another state of "passing tests." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll eventually want to use KokkosKernels for all cublas-like operations, rather than call cublas directly. Newer platforms will not have cublas, and responsibility for on-node operations should lie in KokkosKernels. We'll make conversion to KokkosKernels a goal for later in the FY.
@kddevin wrote:
I'm just waiting for a complete kokkos-kernels BLAS. I'll be happy to switch once it's ready and I know that it's actually calling TPLs. |
@kddevin Also, the point of these changes is cuSOLVER, not cuBLAS. Not all platforms will necessarily have an accelerator-based LAPACK. |
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: Trilinos_pullrequest_gcc_4.8.4
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_intel_17.0.1
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_4.9.3_SERIAL
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_gcc_7.2.0
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_cuda_9.2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_2
Jenkins Parameters
Build InformationTest Name: Trilinos_pullrequest_python_3
Jenkins Parameters
|
Status Flag 'Pre-Merge Inspection' - SUCCESS: The last commit to this Pull Request has been INSPECTED AND APPROVED by [ alanw0 ]! |
Status Flag 'Pull Request AutoTester' - Pull Request will be Automerged |
Merge on Pull Request# 6488: IS A SUCCESS - Pull Request successfully merged |
Thanks all! :-D @kddevin btw I figured out how to make TriBITS automatically detect CUBLAS and CUSOLVER -- by using |
The BLAS/LAPACK calls are added on a need-only basis. We are happy to add the calls needed by Tpetra. Please file a GitHub issue with the calls needed. |
@mhoemmen In the TSQR code, I see the following LAPACK calls:
Are all of these needed in KokkosKernels? I see only the following reimplemented for CUDA in Tsqr_Impl_CuSolver.cpp:
Does that mean we would need to request only the following from KokkosKernels?
Thanks @mhoemmen . |
Continuing to @mhoemmen |
@kddevin Are LAPACK functions actually in-scope for kokkos-kernels? It seemed a bit too much to ask for LAPACK when there were BLAS functions left to implement. |
Given the name of the class (NodeTsqr), should the capability be in KokkosKernels rather than in Tpetra? The comments say the TSQR implementation is not Tpetra-specific and, indeed, no Tpetra data structures are used in the implementation. Probably the entire capability should be moved into KokkosKernels or one of the solver packages (Belos, ShyLu, etc.). |
Yes, we do support some LAPACK functionality at the team level. It is driven by user requests. It is going to take a long time to achieve complete support. Sorry, @mhoemmen I edited your comment by mistake instead of quoting it. Fixed it back. |
@srajama1 Is a NodeTsqr appropriate for KokkosKernels or a better fit for something like ShyLu? |
@kddevin I don't really care where things live, as long as they get enabled and tested by default. There's a risk putting stuff too far downstream in packages like ShyLU, that historically had (have?) components that weren't enabled or tested by default. Also, TSQR is out of scope for ShyLU. |
@kddevin I don't have particular preference for where TSQR lives. That is primarily the decision of the developer given everything else is equal. However, ShyLU is not the right place (ShyLU_Node is mainly sparse factorizations and solvers needed by DD methods). Kokkos Kernels might be ok, Belos might be ok. Essentially, the questions are
@iyamazaki might have some thoughts as well. |
TSQR depends on MPI as well as Kokkos. This makes it out of scope for kokkos-kernels. |
That leaves Belos or Tpetra then. May be a silly question, if it depends on MPI, why is it NodeTSQR ? I guess TSQR depends on MPI and NodeTSQR doesn't. I assumed Karen was asking about NodeTSQR. |
There's NodeTSQR and DistTSQR. Probably doesn't make sense to separate them. I will file a KokkosKernels request for the needed support. |
kokkos/kokkos-kernels#567 |
@trilinos/tpetra @trilinos/belos @trilinos/anasazi @iyamazaki @ndellingwood
Motivation
Stakeholder Feedback
This is mostly funded by the Pressio project, who specifically requested GPU support and optimizations for matrices with more columns. @fnrizzi may wish to comment. ECP ExaWind may also have an interest in TSQR for CA-GMRES. @iyamazaki may wish to comment.
Context
I first added this TSQR implementation to Trilinos in 2010. One learns something after a decade. Trilinos has also changed a lot, in particular due to the introduction of Kokkos (what Trilinos developers would call Kokkos >= 2.0; 1.0 was Chris Baker's 2009 project, and 0.0 was a library of computational kernels circa 2004-5). The lack of Kokkos at the time explains the odd mix of custom matrix views and raw pointers in the interface. I haven't tried to fix all that here.