-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra: Isolate sparse matrix-matrix multiply local kernel #148
Comments
MueLu sometimes actually wants to compute the Jacobi update |
We need both Jacobi & SpMM. They're basically the same kernel if you're smart about it. |
Andrey just explained to me that muelu/src/Misc/MueLu_RAPFactory_def.hpp uses Multiply, so yes :) |
A*P uses mult_A_B_newmatrix with B = P, which does |
I'm looking at mult_A_B_newmatrix now. c_status is just something used internally. There are additional data structures that map between local indices of A, and local indices of B_local and B_remote (see above):
|
The output matrix C of mult_A_B_newmatrix gets its column Map and Import as the union ("setUnion") of the column Maps resp. Imports of B_local and B_remote.
|
I just want to document that I've tried a slightly different variant to speed things up, but was not successful. The thing I tried was to accumulate products and then merge together. https://gist.github.com/aprokop/05ba603a934c7a96ac80 I would like for us to discuss what kind of improvements we may want to consider, and what is the path forward for multithreaded MMM. |
Mehmet and I have an algorithm worked out that we expect to be "performance portable". Mehmet is implementing that. We are expecting Chris & Chris to take care of the distributed memory stuff. |
During the dungeon I wrote a quick serial wrapper for Kokkos experimental graph wrappers in MMM. While I don't have numbers any longer, running the KK1 version was many times slower than the existing implementation for a single thread. |
Not the existing one, that is a preliminary implementation we started with so we have something there. The new one will have two steps, symbolic and numeric. You want it faster than we can develop it. Is there some urgency on this ? We were planning for late April or Early May. Someone wants to isolate the local kernel before then, that is ok. But the threaded kernel on all three platforms that we care will take some work. |
No particular urgency, just a general desire to see faster MMM. |
We can explain our current thinking in person with a board. The primary trouble is getting it to work in GPUs. |
@aprokop |
I tried SPGEMM_MKL and SPGEMM_KK1. KK1 was the slow one, I believe. |
KK1 is incomplete, it does not have multiplication. It shouldn't produce any result. |
I did not check for what it did, I just checked the runtime when running with an example in tpetra. |
@trilinos/tpetra This will contribute to #148.
*newmatrix all matter. *_reuse matters significantly less as the MueLu reuse has not been heavily used yet. |
@mhoemmen: mult_A_B_newmatrix and jacobi_A_B_newmatrix use almost identical local algorithms. The only difference is whether or not you have the Jacobi part. The code is cut and paste because there was no good way to do it all in one routine that I could think of. |
@aprokop, now that RILUK supports reuse properly and efficiently, does that change the importance of reuse? Or are there other blockers to effective reuse in MueLu? |
@ambrad I have not looked at the recent RILUK changes, but if as you say it does the proper separation of symbolic() and numeric(), it definitely makes it easier. If everything on the path to do AdditiveSchwarz with overlapping subdomains and RILUK does the correct separation of symbolic() and numeric() phases, I believe the proper reuse implementation in MueLu can be written quickly. One of my concerns is the overlapping matrix part where I think it does the import during the Setup() phase. |
I'm going to close this, because it kind of happened. |
@trilinos/tpetra @mndevec @srajama1 @csiefer2
In Tpetra, we need to isolate the "local" part of the sparse matrix-matrix multiply kernel from the "global" (MPI communication) part. This will facilitate thread parallelization and the use of third-party libraries' implementations of the local kernel.
The text was updated successfully, but these errors were encountered: