Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

Closed
mhoemmen opened this issue Feb 18, 2016 · 23 comments
Closed

Tpetra: Isolate sparse matrix-matrix multiply local kernel #148

mhoemmen opened this issue Feb 18, 2016 · 23 comments
Assignees

Comments

@mhoemmen
Copy link
Contributor

@trilinos/tpetra @mndevec @srajama1 @csiefer2

In Tpetra, we need to isolate the "local" part of the sparse matrix-matrix multiply kernel from the "global" (MPI communication) part. This will facilitate thread parallelization and the use of third-party libraries' implementations of the local kernel.

@mhoemmen
Copy link
Contributor Author

MueLu sometimes actually wants to compute the Jacobi update $C = (I - \omega D^{-1} A) B$.

@csiefer2
Copy link
Member

We need both Jacobi & SpMM. They're basically the same kernel if you're smart about it.

@mhoemmen
Copy link
Contributor Author

Andrey just explained to me that muelu/src/Misc/MueLu_RAPFactory_def.hpp uses Multiply, so yes :)

@mhoemmen
Copy link
Contributor Author

A*P uses mult_A_B_newmatrix with B = P, which does $C := A (B_{local} + B_{remote})$.

@mhoemmen mhoemmen added resolved: duplicate Issue is really a duplicate of some other issue where the efforts will be focused pkg: Tpetra and removed resolved: duplicate Issue is really a duplicate of some other issue where the efforts will be focused labels Feb 18, 2016
@mhoemmen
Copy link
Contributor Author

I'm looking at mult_A_B_newmatrix now. c_status is just something used internally. There are additional data structures that map between local indices of A, and local indices of B_local and B_remote (see above):

  • Bk = targetMapToOrigRow[Ak] is valid if the local column of A, A.ind[k], is a local row of B_local. In that case, Bk is the current local row of B_local at which to look. Otherwise, Ik = targetMapToImportRow[Ak] is the current local row of B_remote at which to look.
  • Bcol2col[B_local.ind[j]]
  • Icol2col[B_remote.ind[j]]

@mhoemmen
Copy link
Contributor Author

The output matrix C of mult_A_B_newmatrix gets its column Map and Import as the union ("setUnion") of the column Maps resp. Imports of B_local and B_remote.

    // NOTE: This is not efficient and should be folded into setUnion
    Icol2Ccol.resize(Bview.importMatrix->getColMap()->getNodeNumElements());
    ArrayView<const GlobalOrdinal> Bgid = Bview.origMatrix->getColMap()->getNodeElementList();
    ArrayView<const GlobalOrdinal> Igid = Bview.importMatrix->getColMap()->getNodeElementList();
    for(size_t i=0; i<Bview.origMatrix->getColMap()->getNodeNumElements(); i++)
      Bcol2Ccol[i] = Ccolmap->getLocalElement(Bgid[i]);
    for(size_t i=0; i<Bview.importMatrix->getColMap()->getNodeNumElements(); i++)
      Icol2Ccol[i] = Ccolmap->getLocalElement(Igid[i]);

@aprokop
Copy link
Contributor

aprokop commented Feb 25, 2016

I just want to document that I've tried a slightly different variant to speed things up, but was not successful. The thing I tried was to accumulate products and then merge together.

https://gist.github.com/aprokop/05ba603a934c7a96ac80

I would like for us to discuss what kind of improvements we may want to consider, and what is the path forward for multithreaded MMM.

@mhoemmen @mndevec @csiefer2

@srajama1
Copy link
Contributor

Mehmet and I have an algorithm worked out that we expect to be "performance portable". Mehmet is implementing that. We are expecting Chris & Chris to take care of the distributed memory stuff.

@aprokop
Copy link
Contributor

aprokop commented Feb 25, 2016

During the dungeon I wrote a quick serial wrapper for Kokkos experimental graph wrappers in MMM. While I don't have numbers any longer, running the KK1 version was many times slower than the existing implementation for a single thread.

@srajama1
Copy link
Contributor

Not the existing one, that is a preliminary implementation we started with so we have something there. The new one will have two steps, symbolic and numeric. You want it faster than we can develop it. Is there some urgency on this ? We were planning for late April or Early May.

Someone wants to isolate the local kernel before then, that is ok. But the threaded kernel on all three platforms that we care will take some work.

@aprokop
Copy link
Contributor

aprokop commented Feb 25, 2016

No particular urgency, just a general desire to see faster MMM.
Is there a way I can look at the newest implementation/algorithm?

@srajama1
Copy link
Contributor

We can explain our current thinking in person with a board. The primary trouble is getting it to work in GPUs.

@mndevec
Copy link
Contributor

mndevec commented Feb 25, 2016

@aprokop
I dont have a complete implementation yet for Matrix Multiplication. I only have interfaces to MKL, cuSPARSE and CUSP.
Which one did you compare against the sequential?

@aprokop
Copy link
Contributor

aprokop commented Feb 25, 2016

I tried SPGEMM_MKL and SPGEMM_KK1. KK1 was the slow one, I believe.

@mndevec
Copy link
Contributor

mndevec commented Feb 25, 2016

KK1 is incomplete, it does not have multiplication. It shouldn't produce any result.

@aprokop
Copy link
Contributor

aprokop commented Feb 25, 2016

I did not check for what it did, I just checked the runtime when running with an example in tpetra.

@mhoemmen mhoemmen added the story The issue corresponds to a Kanban Story (vs. Epic or Task) label Sep 16, 2016
@mhoemmen mhoemmen added this to the Tpetra-threading milestone Sep 19, 2016
@mhoemmen mhoemmen added task and removed story The issue corresponds to a Kanban Story (vs. Epic or Task) labels Sep 19, 2016
@mhoemmen
Copy link
Contributor Author

@jhux2 @csiefer2 @trilinos/muelu Please tell us which sparse matrix-matrix multiply routines matter most.

  • mult_A_B_newmatrix?
  • jacobi_A_B_newmatrix?
  • What about *_reuse?

@mhoemmen
Copy link
Contributor Author

@csiefer2 @jhux2 Also, is there some reason why mult_A_B_newmatrix and jacobi_A_B_newmatrix use entirely different local sparse matrix-matrix multiply algorithms?

mhoemmen pushed a commit that referenced this issue Sep 28, 2016
@trilinos/tpetra This will contribute to #148.
@aprokop
Copy link
Contributor

aprokop commented Oct 4, 2016

@mhoemmen

Please tell us which sparse matrix-matrix multiply routines matter most.

mult_A_B_newmatrix?
jacobi_A_B_newmatrix?
What about *_reuse?

*newmatrix all matter. *_reuse matters significantly less as the MueLu reuse has not been heavily used yet.

@csiefer2
Copy link
Member

csiefer2 commented Oct 4, 2016

@mhoemmen: mult_A_B_newmatrix and jacobi_A_B_newmatrix use almost identical local algorithms. The only difference is whether or not you have the Jacobi part. The code is cut and paste because there was no good way to do it all in one routine that I could think of.

@ambrad
Copy link
Contributor

ambrad commented Oct 4, 2016

@aprokop, now that RILUK supports reuse properly and efficiently, does that change the importance of reuse? Or are there other blockers to effective reuse in MueLu?

@aprokop
Copy link
Contributor

aprokop commented Oct 4, 2016

@ambrad I have not looked at the recent RILUK changes, but if as you say it does the proper separation of symbolic() and numeric(), it definitely makes it easier. If everything on the path to do AdditiveSchwarz with overlapping subdomains and RILUK does the correct separation of symbolic() and numeric() phases, I believe the proper reuse implementation in MueLu can be written quickly.

One of my concerns is the overlapping matrix part where I think it does the import during the Setup() phase.

@mhoemmen mhoemmen modified the milestones: Tpetra-backlog, Tpetra-threading Nov 2, 2016
@mhoemmen
Copy link
Contributor Author

I'm going to close this, because it kind of happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants