You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve been testing multithreaded symmetric Gauss-Seidel (MTSGS) through Ifpack2 on some application matrices on Vortex. These matrices have ~699K rows, and most of the row stencil sizes are 50 or less, but there are some non-sparse rows (called “bulk rows”). Here are the nonzero counts by row, sorted largest first:
42254 31980 5088 237 48 47 …
Those first four bulk rows appear to be hurting the performance of MTSGS. I did some experiments to see what effect the bulk rows have. In each experiment, the linear system is solved 10 times with GMRES preconditioned by 3 MTSGS sweeps. There are MPI barriers before/after calls the MTSGS kernel, as well as timers for the barriers themselves. What I found is that removing the bulk rows yields about a 13x speedup in KokkosSparse::Experimental::symmetric_gauss_seidel_apply.
experiment #3: run with application's rowmap, but zero out the three matrix rows with the largest #nonzeros and put a 1 on the diagonal for those rows (i.e., make them Dirichlet rows)
The text was updated successfully, but these errors were encountered:
jhux2
changed the title
Multithreaded Symmetric Gauss-Seidel performance for matrices with a few dense rows
Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows
Mar 30, 2021
I’ve been testing multithreaded symmetric Gauss-Seidel (MTSGS) through Ifpack2 on some application matrices on Vortex. These matrices have ~699K rows, and most of the row stencil sizes are 50 or less, but there are some non-sparse rows (called “bulk rows”). Here are the nonzero counts by row, sorted largest first:
42254 31980 5088 237 48 47 …
Those first four bulk rows appear to be hurting the performance of MTSGS. I did some experiments to see what effect the bulk rows have. In each experiment, the linear system is solved 10 times with GMRES preconditioned by 3 MTSGS sweeps. There are MPI barriers before/after calls the MTSGS kernel, as well as timers for the barriers themselves. What I found is that removing the bulk rows yields about a 13x speedup in
KokkosSparse::Experimental::symmetric_gauss_seidel_apply
.Here's a summary of the experiments:
experiment #1: run with application rowmap
experiment #2: run with uniform map (so each GPU has about the same #nonzeros)
experiment #3: run with application's rowmap, but zero out the three matrix rows with the largest #nonzeros and put a 1 on the diagonal for those rows (i.e., make them Dirichlet rows)
@srajama1 @brian-kelley @lucbv
The text was updated successfully, but these errors were encountered: