Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603

Closed
mhoemmen opened this issue Jan 17, 2020 · 4 comments
Closed
Assignees
Labels
pkg: Tpetra system: gpu type: enhancement Issue is an enhancement, not a bug

Comments

@mhoemmen
Copy link
Contributor

Bug Report

@trilinos/tpetra

Description

@vbrunini reports that Tpetra::CrsMatrix::unpackAndCombine is causing significant load imbalance for Aria's L2 Milestone problem, since shared bulk nodes mean that some MPI processes must unpack nearly dense rows at doExport. The current unpackAndCombine implementation has one level of thread parallelism, over rows to unpack. If there are a few rows with many entries, this is inefficient on GPUs.

Suggested fix

  • Introduce hierarchical parallelism for unpacking rows.
  • Treat long rows in a separate pass from short rows when unpacking. We have row length on both host and device, so we can preprocess on host and launch two separate kernels. Task parallelism per thread block might be another option.

Steps to Reproduce

  1. Start with Tpetra's example / benchmark.
@mhoemmen mhoemmen added the type: bug The primary issue is a bug in Trilinos code or tests label Jan 17, 2020
trilinos-autotester added a commit that referenced this issue Jan 21, 2020
…-rows-part-1

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: Tpetra: Start work on #6603 (optimize CrsMatrix GPU unpack for long rows)
PR Author: mhoemmen
@mhoemmen
Copy link
Contributor Author

We could take either of two possible directions:

  1. Sort the unpacked rows before merging them into the matrix; use sorted-sorted (both input and matrix row are sorted) merge.
  2. Use hierarchical parallelism, over both rows and entries within a row.

I have been sketching out (2), but that might not help so much if it doesn't have a team sort.

@brian-kelley
Copy link
Contributor

@mhoemmen Team sort! #6646

@mhoemmen mhoemmen added pkg: Tpetra type: enhancement Issue is an enhancement, not a bug system: gpu and removed type: bug The primary issue is a bug in Trilinos code or tests labels Mar 11, 2020
@mhoemmen
Copy link
Contributor Author

Here is a sketch of some possible solutions. At first, I was thinking of the case where the rows mostly have about the same number of entries, but that number of entries could be very short or very long. For completeness, I'll describe a sketch of the solution to that problem first. After that, I'll talk about the case where different rows have widely different numbers of entries.

When we refer to "long rows" in what follows, we mean the number of incoming (received) entries to unpack and combine with a local row of the target CrsMatrix, not the number of entries in the target local row.

In CrsMatrix::unpackAndCombine, incoming data are arranged first by target local row index. This means sending less data; otherwise, the sender would need to send (row, col, val) triples. Incoming data use global column indices, since the target object may not be locally indexed, or may have a different column Map. We need to unpack the data -- we can't just read it directly, since the current pack format does not promise correct alignment. It's incorrect to access unaligned int64_t, double, etc., especially on GPUs. (This would make a "merge path" - style load-balanced algorithm (as in cuSPARSE's sparse matrix-vector multiply) harder to implement and likely slower, since it could not easily search the incoming data for the starting column index. It also makes sorting the incoming data in place harder. Tpetra::CrsMatrix does have the freedom to overwrite the incoming data, though.)

We care most about the case where the target CrsMatrix has a fill-complete graph. This means that it is locally indexed, and that the column indices in each row are sorted and merged (there are no duplicate column indices in a row). Here is the natural parallelization:

For each target local row index lclRow:
    For each entry (gblCol, val) to unpack:
        Convert global column index gblCol to local column index lclCol
        Unpack (lclRow, lclCol, val) into target local matrix

It's possible that different MPI processes might send my MPI process entries to sum (+=) into the same local row and column. Thus, we always need to do atomic +=. This is per entry, not per row, so it's not so bad to have long rows.

The only greater-than-constant-time cost in unpacking is searching the current row of our target matrix for the local column index of each entry to unpack. This can and should (and does) use binary search, since the target matrix has sorted rows. However, in general the incoming data for that row could be completely disordered, resulting in an E * log(K) number of comparisons / target row entry reads (where E is the number of entries to unpack, and K is the number of entries in the target row).

If we could sort the incoming data for that row, that would require at most one pass over the target row (max(E, K) comparisons / reads). However, it's hard to do that, for the reasons mentioned above (packed data aren't aligned). We could force alignment of the incoming packed data for a small increase in communication volume, or cleverly sort the packed data (careful -- this must be by local column index, but the packed data have global indices), or do a preprocessing phase where we copy out the packed data and sort by rows (but that has comparable parallelization and load imbalance issues). Alternately, we could sort only the entries within a thread team, using team scratch and Brian Kelley's new team sort.

We can imagine 4 different parallelization schemes, one for each of various average number of entries per row to unpack.

  1. Short rows: one-level Kokkos::parallel_for over rows to unpack. (If the rows are too short, then it doesn't pay to parallelize over entries within a row.)

  2. Medium-length rows: three-level Kokkos::parallel_for: over blocks of rows to unpack, then over rows within a block, then over entries within a row. (This is how Christian's sparse matrix-vector multiply in kokkos-kernels works.) If we wanted to sort, it would need to happen at the innermost level of parallelization, so we would need a "vector sort." There may be good primitives we could use for that.

  3. Long rows: two-level Kokkos::parallel_for: over rows, then over entries within a row. It would make sense to use team scratch and team sort at the inner level of parallelization.

  4. Very long rows: three-level Kokkos::parallel_for: over rows, then over blocks of entries within a row, then over entries within a block. This would only be useful for a very small number of rows to unpack.

@mhoemmen
Copy link
Contributor Author

For the case where there are a small number of rows with many entries to unpack, and some number of rows with fewer entries to unpack, the only thing we can do in the current Kokkos programming model is to split those into two kernel launches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra system: gpu type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

4 participants