-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603
Comments
…-rows-part-1 Automatically Merged using Trilinos Pull Request AutoTester PR Title: Tpetra: Start work on #6603 (optimize CrsMatrix GPU unpack for long rows) PR Author: mhoemmen
We could take either of two possible directions:
I have been sketching out (2), but that might not help so much if it doesn't have a team sort. |
Here is a sketch of some possible solutions. At first, I was thinking of the case where the rows mostly have about the same number of entries, but that number of entries could be very short or very long. For completeness, I'll describe a sketch of the solution to that problem first. After that, I'll talk about the case where different rows have widely different numbers of entries. When we refer to "long rows" in what follows, we mean the number of incoming (received) entries to unpack and combine with a local row of the target CrsMatrix, not the number of entries in the target local row. In We care most about the case where the target CrsMatrix has a fill-complete graph. This means that it is locally indexed, and that the column indices in each row are sorted and merged (there are no duplicate column indices in a row). Here is the natural parallelization:
It's possible that different MPI processes might send my MPI process entries to sum ( The only greater-than-constant-time cost in unpacking is searching the current row of our target matrix for the local column index of each entry to unpack. This can and should (and does) use binary search, since the target matrix has sorted rows. However, in general the incoming data for that row could be completely disordered, resulting in an If we could sort the incoming data for that row, that would require at most one pass over the target row ( We can imagine 4 different parallelization schemes, one for each of various average number of entries per row to unpack.
|
For the case where there are a small number of rows with many entries to unpack, and some number of rows with fewer entries to unpack, the only thing we can do in the current Kokkos programming model is to split those into two kernel launches. |
Bug Report
@trilinos/tpetra
Description
@vbrunini reports that
Tpetra::CrsMatrix::unpackAndCombine
is causing significant load imbalance for Aria's L2 Milestone problem, since shared bulk nodes mean that some MPI processes must unpack nearly dense rows atdoExport
. The currentunpackAndCombine
implementation has one level of thread parallelism, over rows to unpack. If there are a few rows with many entries, this is inefficient on GPUs.Suggested fix
Steps to Reproduce
The text was updated successfully, but these errors were encountered: