Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

mhoemmen · 2016-02-18T17:44:26Z

@trilinos/tpetra @mndevec @srajama1 @csiefer2

In Tpetra, we need to isolate the "local" part of the sparse matrix-matrix multiply kernel from the "global" (MPI communication) part. This will facilitate thread parallelization and the use of third-party libraries' implementations of the local kernel.

mhoemmen · 2016-02-18T17:47:10Z

MueLu sometimes actually wants to compute the Jacobi update $C = (I - \omega D^{-1} A) B$.

csiefer2 · 2016-02-18T17:51:03Z

We need both Jacobi & SpMM. They're basically the same kernel if you're smart about it.

mhoemmen · 2016-02-18T18:17:10Z

Andrey just explained to me that muelu/src/Misc/MueLu_RAPFactory_def.hpp uses Multiply, so yes :)

mhoemmen · 2016-02-18T18:17:59Z

A*P uses mult_A_B_newmatrix with B = P, which does $C := A (B_{local} + B_{remote})$.

mhoemmen · 2016-02-18T18:26:41Z

I'm looking at mult_A_B_newmatrix now. c_status is just something used internally. There are additional data structures that map between local indices of A, and local indices of B_local and B_remote (see above):

Bk = targetMapToOrigRow[Ak] is valid if the local column of A, A.ind[k], is a local row of B_local. In that case, Bk is the current local row of B_local at which to look. Otherwise, Ik = targetMapToImportRow[Ak] is the current local row of B_remote at which to look.
Bcol2col[B_local.ind[j]]
Icol2col[B_remote.ind[j]]

mhoemmen · 2016-02-18T18:28:51Z

The output matrix C of mult_A_B_newmatrix gets its column Map and Import as the union ("setUnion") of the column Maps resp. Imports of B_local and B_remote.

    // NOTE: This is not efficient and should be folded into setUnion
    Icol2Ccol.resize(Bview.importMatrix->getColMap()->getNodeNumElements());
    ArrayView<const GlobalOrdinal> Bgid = Bview.origMatrix->getColMap()->getNodeElementList();
    ArrayView<const GlobalOrdinal> Igid = Bview.importMatrix->getColMap()->getNodeElementList();
    for(size_t i=0; i<Bview.origMatrix->getColMap()->getNodeNumElements(); i++)
      Bcol2Ccol[i] = Ccolmap->getLocalElement(Bgid[i]);
    for(size_t i=0; i<Bview.importMatrix->getColMap()->getNodeNumElements(); i++)
      Icol2Ccol[i] = Ccolmap->getLocalElement(Igid[i]);

aprokop · 2016-02-25T00:02:20Z

I just want to document that I've tried a slightly different variant to speed things up, but was not successful. The thing I tried was to accumulate products and then merge together.

https://gist.github.com/aprokop/05ba603a934c7a96ac80

I would like for us to discuss what kind of improvements we may want to consider, and what is the path forward for multithreaded MMM.

@mhoemmen @mndevec @csiefer2

srajama1 · 2016-02-25T00:04:55Z

Mehmet and I have an algorithm worked out that we expect to be "performance portable". Mehmet is implementing that. We are expecting Chris & Chris to take care of the distributed memory stuff.

aprokop · 2016-02-25T00:08:44Z

During the dungeon I wrote a quick serial wrapper for Kokkos experimental graph wrappers in MMM. While I don't have numbers any longer, running the KK1 version was many times slower than the existing implementation for a single thread.

srajama1 · 2016-02-25T02:43:22Z

Not the existing one, that is a preliminary implementation we started with so we have something there. The new one will have two steps, symbolic and numeric. You want it faster than we can develop it. Is there some urgency on this ? We were planning for late April or Early May.

Someone wants to isolate the local kernel before then, that is ok. But the threaded kernel on all three platforms that we care will take some work.

aprokop · 2016-02-25T02:50:49Z

No particular urgency, just a general desire to see faster MMM.
Is there a way I can look at the newest implementation/algorithm?

srajama1 · 2016-02-25T02:54:51Z

We can explain our current thinking in person with a board. The primary trouble is getting it to work in GPUs.

mndevec · 2016-02-25T15:34:07Z

@aprokop
I dont have a complete implementation yet for Matrix Multiplication. I only have interfaces to MKL, cuSPARSE and CUSP.
Which one did you compare against the sequential?

aprokop · 2016-02-25T16:58:25Z

I tried SPGEMM_MKL and SPGEMM_KK1. KK1 was the slow one, I believe.

mndevec · 2016-02-25T16:59:33Z

KK1 is incomplete, it does not have multiplication. It shouldn't produce any result.

aprokop · 2016-02-25T19:09:02Z

I did not check for what it did, I just checked the runtime when running with an example in tpetra.

mhoemmen · 2016-09-27T23:55:47Z

@jhux2 @csiefer2 @trilinos/muelu Please tell us which sparse matrix-matrix multiply routines matter most.

mult_A_B_newmatrix?
jacobi_A_B_newmatrix?
What about *_reuse?

mhoemmen · 2016-09-27T23:56:34Z

@csiefer2 @jhux2 Also, is there some reason why mult_A_B_newmatrix and jacobi_A_B_newmatrix use entirely different local sparse matrix-matrix multiply algorithms?

@trilinos/tpetra This will contribute to #148.

aprokop · 2016-10-04T18:51:32Z

@mhoemmen

Please tell us which sparse matrix-matrix multiply routines matter most.

mult_A_B_newmatrix?
jacobi_A_B_newmatrix?
What about *_reuse?

*newmatrix all matter. *_reuse matters significantly less as the MueLu reuse has not been heavily used yet.

csiefer2 · 2016-10-04T18:57:09Z

@mhoemmen: mult_A_B_newmatrix and jacobi_A_B_newmatrix use almost identical local algorithms. The only difference is whether or not you have the Jacobi part. The code is cut and paste because there was no good way to do it all in one routine that I could think of.

ambrad · 2016-10-04T19:31:39Z

@aprokop, now that RILUK supports reuse properly and efficiently, does that change the importance of reuse? Or are there other blockers to effective reuse in MueLu?

aprokop · 2016-10-04T22:15:43Z

@ambrad I have not looked at the recent RILUK changes, but if as you say it does the proper separation of symbolic() and numeric(), it definitely makes it easier. If everything on the path to do AdditiveSchwarz with overlapping subdomains and RILUK does the correct separation of symbolic() and numeric() phases, I believe the proper reuse implementation in MueLu can be written quickly.

One of my concerns is the overlapping matrix part where I think it does the import during the Setup() phase.

mhoemmen · 2017-02-20T23:11:30Z

I'm going to close this, because it kind of happened.

mhoemmen added resolved: duplicate Issue is really a duplicate of some other issue where the efforts will be focused pkg: Tpetra and removed resolved: duplicate Issue is really a duplicate of some other issue where the efforts will be focused labels Feb 18, 2016

mhoemmen mentioned this issue Jun 9, 2016

KokkosKernels,Tpetra: Plug in TPLs (MKL, cuSPARSE, ...) for sparse matrix-matrix multiply #430

Closed

kddevin assigned csiefer2 Aug 26, 2016

mhoemmen added the story The issue corresponds to a Kanban Story (vs. Epic or Task) label Sep 16, 2016

mhoemmen mentioned this issue Sep 19, 2016

Tpetra: Make sparse matrix-matrix multiply thread-parallel #629

Closed

mhoemmen added this to the Tpetra-threading milestone Sep 19, 2016

mhoemmen added task and removed story The issue corresponds to a Kanban Story (vs. Epic or Task) labels Sep 19, 2016

mhoemmen pushed a commit that referenced this issue Sep 28, 2016

Tpetra: Add internal comments to sparse matrix-matrix multiply

aa4b74b

@trilinos/tpetra This will contribute to #148.

mhoemmen modified the milestones: Tpetra-backlog, Tpetra-threading Nov 2, 2016

mhoemmen modified the milestones: Tpetra-FY17-Q2, Tpetra-backlog Dec 6, 2016

mhoemmen closed this as completed Feb 20, 2017

trilinos-autotester mentioned this issue Nov 4, 2021

Trilinos Master Merge PR Generator: Auto PR created to promote from master_merge_20211103_000550 branch to master #9895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

csiefer2 commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

aprokop commented Feb 25, 2016

srajama1 commented Feb 25, 2016

aprokop commented Feb 25, 2016

srajama1 commented Feb 25, 2016

aprokop commented Feb 25, 2016

srajama1 commented Feb 25, 2016

mndevec commented Feb 25, 2016

aprokop commented Feb 25, 2016

mndevec commented Feb 25, 2016

aprokop commented Feb 25, 2016

mhoemmen commented Sep 27, 2016

mhoemmen commented Sep 27, 2016

aprokop commented Oct 4, 2016

csiefer2 commented Oct 4, 2016

ambrad commented Oct 4, 2016

aprokop commented Oct 4, 2016

mhoemmen commented Feb 20, 2017

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

Comments

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

csiefer2 commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

mhoemmen commented Feb 18, 2016

aprokop commented Feb 25, 2016

srajama1 commented Feb 25, 2016

aprokop commented Feb 25, 2016

srajama1 commented Feb 25, 2016

aprokop commented Feb 25, 2016

srajama1 commented Feb 25, 2016

mndevec commented Feb 25, 2016

aprokop commented Feb 25, 2016

mndevec commented Feb 25, 2016

aprokop commented Feb 25, 2016

mhoemmen commented Sep 27, 2016

mhoemmen commented Sep 27, 2016

aprokop commented Oct 4, 2016

csiefer2 commented Oct 4, 2016

ambrad commented Oct 4, 2016

aprokop commented Oct 4, 2016

mhoemmen commented Feb 20, 2017