Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows #923

jhux2 · 2021-03-30T17:31:24Z

I’ve been testing multithreaded symmetric Gauss-Seidel (MTSGS) through Ifpack2 on some application matrices on Vortex. These matrices have ~699K rows, and most of the row stencil sizes are 50 or less, but there are some non-sparse rows (called “bulk rows”). Here are the nonzero counts by row, sorted largest first:

42254 31980 5088 237 48 47 …

Those first four bulk rows appear to be hurting the performance of MTSGS. I did some experiments to see what effect the bulk rows have. In each experiment, the linear system is solved 10 times with GMRES preconditioned by 3 MTSGS sweeps. There are MPI barriers before/after calls the MTSGS kernel, as well as timers for the barriers themselves. What I found is that removing the bulk rows yields about a 13x speedup in KokkosSparse::Experimental::symmetric_gauss_seidel_apply.

Here's a summary of the experiments:

experiment #1: run with application rowmap

Driver: 5 - Belos Solve                                                    69.17 (10)        69.17 (10)        69.17 (10)
Driver: S - Global Time                                                    112.6 (1)         112.6 (1)         112.6 (1)
Ifpack2::Relaxation::ApplyInverseMTGS_CrsMatrix : import                   0.2921 (1820)     0.298 (1820)      0.3063 (1820)
Ifpack2::Relaxation::apply                                                 65.57 (910)       65.58 (910)       65.59 (910)
Ifpack2::Relaxation::compute                                               0.003969 (1)      0.006031 (1)      0.008348 (1)
Ifpack2::Relaxation::initialize                                            0.04399 (1)       0.05346 (1)       0.06291 (1)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply                   4.963 (2730)      29.64 (2730)      65.07 (2730)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (post)    0.00592 (2730)    35.44 (2730)      60.11 (2730)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (pre)     0.09387 (2730)    0.1073 (2730)     0.1223 (2730)

experiment #2: run with uniform map (so each GPU has about the same #nonzeros)

Driver: 5 - Belos Solve                                                    101.9 (10)        101.9 (10)        101.9 (10)
Driver: S - Global Time                                                    140.9 (1)         140.9 (1)         140.9 (1)
Ifpack2::Relaxation::ApplyInverseMTGS_CrsMatrix : import                   0.4447 (2280)     0.4612 (2280)     0.4784 (2280)
Ifpack2::Relaxation::apply                                                 97.44 (1140)      97.47 (1140)      97.52 (1140)
Ifpack2::Relaxation::compute                                               0.004134 (1)      0.004919 (1)      0.006527 (1)
Ifpack2::Relaxation::initialize                                            0.04344 (1)       0.05109 (1)       0.06242 (1)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply                   6.22 (3420)       28.94 (3420)      96.77 (3420)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (post)    0.01009 (3420)    67.83 (3420)      90.56 (3420)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (pre)     0.05467 (3420)    0.1058 (3420)     0.1649 (3420)
MueLu: Hierarchy: Setup (total)

experiment #3: run with application's rowmap, but zero out the three matrix rows with the largest #nonzeros and put a 1 on the diagonal for those rows (i.e., make them Dirichlet rows)

Driver: 5 - Belos Solve                                                    9.082 (10)        9.082 (10)        9.083 (10)
Driver: S - Global Time                                                    50.11 (1)         50.11 (1)         50.11 (1)
Ifpack2::Relaxation::ApplyInverseMTGS_CrsMatrix : import                   0.3822 (1840)     0.4599 (1840)     0.5211 (1840)
Ifpack2::Relaxation::apply                                                 5.655 (920)       5.665 (920)       5.681 (920)
Ifpack2::Relaxation::compute                                               0.004102 (1)      0.004201 (1)      0.004295 (1)
Ifpack2::Relaxation::initialize                                            0.04331 (1)       0.04612 (1)       0.04872 (1)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply                   4.712 (2760)      4.816 (2760)      4.918 (2760)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (post)    0.03125 (2760)    0.1326 (2760)     0.2372 (2760)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (pre)     0.08867 (2760)    0.1597 (2760)     0.2545 (2760)
MueLu: Hierarchy: Setup (total)

@srajama1 @brian-kelley @lucbv

The text was updated successfully, but these errors were encountered:

jhux2 · 2021-05-06T20:50:04Z

Any updates? In particular, when do you think you might have an algorithm that I could start kicking the tires on? Thanks!

jhux2 · 2021-08-03T15:49:03Z

Closing as fixed.

jhux2 changed the title ~~Multithreaded Symmetric Gauss-Seidel performance for matrices with a few dense rows~~ Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows Mar 30, 2021

This was referenced May 25, 2021

KokkosKernels/Ifpack2: MTGS long row kernel ( trilinos/Trilinos#9174

Merged

Point multicolor GS: faster handling of long/bulk rows #993

Merged

jhux2 closed this as completed Aug 3, 2021

kokkos-devops-admin mentioned this issue May 23, 2024

Interface for LAPACK geqrf() #2205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows #923

Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows #923

jhux2 commented Mar 30, 2021 •

edited

Loading

jhux2 commented May 6, 2021 •

edited

Loading

jhux2 commented Aug 3, 2021

Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows #923

Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows #923

Comments

jhux2 commented Mar 30, 2021 • edited Loading

jhux2 commented May 6, 2021 • edited Loading

jhux2 commented Aug 3, 2021

jhux2 commented Mar 30, 2021 •

edited

Loading

jhux2 commented May 6, 2021 •

edited

Loading