Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603

mhoemmen · 2020-01-17T19:01:49Z

Bug Report

@trilinos/tpetra

Description

@vbrunini reports that Tpetra::CrsMatrix::unpackAndCombine is causing significant load imbalance for Aria's L2 Milestone problem, since shared bulk nodes mean that some MPI processes must unpack nearly dense rows at doExport. The current unpackAndCombine implementation has one level of thread parallelism, over rows to unpack. If there are a few rows with many entries, this is inefficient on GPUs.

Suggested fix

Introduce hierarchical parallelism for unpacking rows.
Treat long rows in a separate pass from short rows when unpacking. We have row length on both host and device, so we can preprocess on host and launch two separate kernels. Task parallelism per thread block might be another option.

Steps to Reproduce

Start with Tpetra's example / benchmark.

The text was updated successfully, but these errors were encountered:

…-rows-part-1 Automatically Merged using Trilinos Pull Request AutoTester PR Title: Tpetra: Start work on #6603 (optimize CrsMatrix GPU unpack for long rows) PR Author: mhoemmen

mhoemmen · 2020-01-22T01:19:13Z

We could take either of two possible directions:

Sort the unpacked rows before merging them into the matrix; use sorted-sorted (both input and matrix row are sorted) merge.
Use hierarchical parallelism, over both rows and entries within a row.

I have been sketching out (2), but that might not help so much if it doesn't have a team sort.

brian-kelley · 2020-01-24T23:17:22Z

@mhoemmen Team sort! #6646

mhoemmen · 2020-03-11T05:44:27Z

Here is a sketch of some possible solutions. At first, I was thinking of the case where the rows mostly have about the same number of entries, but that number of entries could be very short or very long. For completeness, I'll describe a sketch of the solution to that problem first. After that, I'll talk about the case where different rows have widely different numbers of entries.

When we refer to "long rows" in what follows, we mean the number of incoming (received) entries to unpack and combine with a local row of the target CrsMatrix, not the number of entries in the target local row.

In CrsMatrix::unpackAndCombine, incoming data are arranged first by target local row index. This means sending less data; otherwise, the sender would need to send (row, col, val) triples. Incoming data use global column indices, since the target object may not be locally indexed, or may have a different column Map. We need to unpack the data -- we can't just read it directly, since the current pack format does not promise correct alignment. It's incorrect to access unaligned int64_t, double, etc., especially on GPUs. (This would make a "merge path" - style load-balanced algorithm (as in cuSPARSE's sparse matrix-vector multiply) harder to implement and likely slower, since it could not easily search the incoming data for the starting column index. It also makes sorting the incoming data in place harder. Tpetra::CrsMatrix does have the freedom to overwrite the incoming data, though.)

We care most about the case where the target CrsMatrix has a fill-complete graph. This means that it is locally indexed, and that the column indices in each row are sorted and merged (there are no duplicate column indices in a row). Here is the natural parallelization:

For each target local row index lclRow:
    For each entry (gblCol, val) to unpack:
        Convert global column index gblCol to local column index lclCol
        Unpack (lclRow, lclCol, val) into target local matrix

It's possible that different MPI processes might send my MPI process entries to sum (+=) into the same local row and column. Thus, we always need to do atomic +=. This is per entry, not per row, so it's not so bad to have long rows.

The only greater-than-constant-time cost in unpacking is searching the current row of our target matrix for the local column index of each entry to unpack. This can and should (and does) use binary search, since the target matrix has sorted rows. However, in general the incoming data for that row could be completely disordered, resulting in an E * log(K) number of comparisons / target row entry reads (where E is the number of entries to unpack, and K is the number of entries in the target row).

If we could sort the incoming data for that row, that would require at most one pass over the target row (max(E, K) comparisons / reads). However, it's hard to do that, for the reasons mentioned above (packed data aren't aligned). We could force alignment of the incoming packed data for a small increase in communication volume, or cleverly sort the packed data (careful -- this must be by local column index, but the packed data have global indices), or do a preprocessing phase where we copy out the packed data and sort by rows (but that has comparable parallelization and load imbalance issues). Alternately, we could sort only the entries within a thread team, using team scratch and Brian Kelley's new team sort.

We can imagine 4 different parallelization schemes, one for each of various average number of entries per row to unpack.

Short rows: one-level Kokkos::parallel_for over rows to unpack. (If the rows are too short, then it doesn't pay to parallelize over entries within a row.)
Medium-length rows: three-level Kokkos::parallel_for: over blocks of rows to unpack, then over rows within a block, then over entries within a row. (This is how Christian's sparse matrix-vector multiply in kokkos-kernels works.) If we wanted to sort, it would need to happen at the innermost level of parallelization, so we would need a "vector sort." There may be good primitives we could use for that.
Long rows: two-level Kokkos::parallel_for: over rows, then over entries within a row. It would make sense to use team scratch and team sort at the inner level of parallelization.
Very long rows: three-level Kokkos::parallel_for: over rows, then over blocks of entries within a row, then over entries within a block. This would only be useful for a very small number of rows to unpack.

mhoemmen · 2020-03-11T05:50:03Z

For the case where there are a small number of rows with many entries to unpack, and some number of rows with fewer entries to unpack, the only thing we can do in the current Kokkos programming model is to split those into two kernel launches.

mhoemmen added the type: bug The primary issue is a bug in Trilinos code or tests label Jan 17, 2020

mhoemmen mentioned this issue Jan 17, 2020

Tpetra: Start work on #6603 (optimize CrsMatrix GPU unpack for long rows) #6604

Merged

mhoemmen added pkg: Tpetra type: enhancement Issue is an enhancement, not a bug system: gpu and removed type: bug The primary issue is a bug in Trilinos code or tests labels Mar 11, 2020

tjfulle self-assigned this Apr 22, 2020

tjfulle mentioned this issue Jun 10, 2020

Tpetra: Hierarchical unpack #7512

Merged

kddevin closed this as completed Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603

Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603

mhoemmen commented Jan 17, 2020

mhoemmen commented Jan 22, 2020

brian-kelley commented Jan 24, 2020

mhoemmen commented Mar 11, 2020

mhoemmen commented Mar 11, 2020

Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603

Tpetra::CrsMatrix::unpackAndCombine: Optimize GPU unpack for few long rows #6603

Comments

mhoemmen commented Jan 17, 2020

Bug Report

Description

Suggested fix

Steps to Reproduce

mhoemmen commented Jan 22, 2020

brian-kelley commented Jan 24, 2020

mhoemmen commented Mar 11, 2020

mhoemmen commented Mar 11, 2020