Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifpack2: Fused block jacobi #13837

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

brian-kelley
Copy link
Contributor

@brian-kelley brian-kelley commented Feb 26, 2025

More performant paths for block Jacobi case inside BTDS (GPU only, BlockCrs only).

  • Fuses residual and diagonal solve into one kernel
  • Inverts diagonal blocks completely in shared memory before writing them back out. This would double the shared mem requirement per team, so to compensate the vector length is halved.

Compared to #13805, this gives between 9% (bs = 7) and 33% (bs = 11) speedup on the overall solve and about 1.9x speedup in numeric setup for both block sizes. Note: this was measured on a single GPU run, so the speedup only applies to local computation. Multi-rank runs will speed up less due to time spent doing communication.

@trilinos/ifpack2

Related Issues

Follows #13805

Stakeholder Feedback

Will ask SPARC team to evaluate.

Testing

Tested in Ifpack2_BlockTriDiContainerUnitAndPerfTests, and ran this on OpenMP, Cuda and HIP. For Cuda tested double, float and complex_double.

@brian-kelley brian-kelley added pkg: Ifpack2 impacting: performance client: SPARC Issues related to or needed more specifically by the ATDM SPARC code labels Feb 26, 2025
@brian-kelley brian-kelley requested a review from lucbv February 26, 2025 05:30
@brian-kelley brian-kelley self-assigned this Feb 26, 2025
@brian-kelley brian-kelley requested a review from a team as a code owner February 26, 2025 05:30
@brian-kelley brian-kelley changed the title Fused block jacobi Ifpack2: Fused block jacobi Feb 26, 2025
More performant path for block Jacobi case inside BTDS
(GPU only, BlockCrs only). Fuses residual and solve
into one kernel and doesn't convert vectors to SIMD-packed
format. Also inverts diag blocks fully in shared to speed up numeric.

Signed-off-by: Brian Kelley <[email protected]>
@trilinos-autotester
Copy link
Contributor

Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection is Not Necessary for this Pull Request.

@trilinos-autotester
Copy link
Contributor

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: PR_gcc-openmpi-openmp

  • Build Num: 1165
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
FORCE_CLEAN true
GENCONFIG_BUILD_NAME rhel8_sems-gnu-8.5.0-openmpi-4.1.6-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
PR_LABELS pkg: Ifpack2;impacting: performance;client: SPARC
PULLREQUESTNUM 13837
PULLREQUEST_CDASH_TRACK Pull Request
TEST_REPO_ALIAS TRILINOS
TRILINOS_NODE_LABEL rhel8
TRILINOS_SOURCE_REPO https://github.com/brian-kelley/Trilinos
TRILINOS_SOURCE_SHA 29fe448
TRILINOS_SRN_CONFIG true
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA 8eca3f9

Build Information

Test Name: PR_gcc-openmpi_debug

  • Build Num: 1216
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
FORCE_CLEAN true
GENCONFIG_BUILD_NAME rhel8_sems-gnu-8.5.0-openmpi-4.1.6-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
PR_LABELS pkg: Ifpack2;impacting: performance;client: SPARC
PULLREQUESTNUM 13837
PULLREQUEST_CDASH_TRACK Pull Request
TEST_REPO_ALIAS TRILINOS
TRILINOS_NODE_LABEL rhel8
TRILINOS_SOURCE_REPO https://github.com/brian-kelley/Trilinos
TRILINOS_SOURCE_SHA 29fe448
TRILINOS_SRN_CONFIG true
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA 8eca3f9

Build Information

Test Name: PR_clang

  • Build Num: 1214
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
FORCE_CLEAN true
GENCONFIG_BUILD_NAME rhel8_sems-clang-11.0.1-openmpi-4.0.5-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
PR_LABELS pkg: Ifpack2;impacting: performance;client: SPARC
PULLREQUESTNUM 13837
PULLREQUEST_CDASH_TRACK Pull Request
TEST_REPO_ALIAS TRILINOS
TRILINOS_NODE_LABEL rhel8
TRILINOS_SOURCE_REPO https://github.com/brian-kelley/Trilinos
TRILINOS_SOURCE_SHA 29fe448
TRILINOS_SRN_CONFIG true
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA 8eca3f9

Build Information

Test Name: PR_cuda

  • Build Num: 1213
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
FORCE_CLEAN true
GENCONFIG_BUILD_NAME rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables
PR_LABELS pkg: Ifpack2;impacting: performance;client: SPARC
PULLREQUESTNUM 13837
PULLREQUEST_CDASH_TRACK Pull Request
TEST_REPO_ALIAS TRILINOS
TRILINOS_NODE_LABEL rhel8-gpu
TRILINOS_SOURCE_REPO https://github.com/brian-kelley/Trilinos
TRILINOS_SOURCE_SHA 29fe448
TRILINOS_SRN_CONFIG true
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA 8eca3f9

Build Information

Test Name: PR_intel

  • Build Num: 1134
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
FORCE_CLEAN true
GENCONFIG_BUILD_NAME rhel8_sems-intel-2021.3-sems-openmpi-4.1.6_release-debug_shared_no-kokkos-arch_no-asan_no-complex_fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
PR_LABELS pkg: Ifpack2;impacting: performance;client: SPARC
PULLREQUESTNUM 13837
PULLREQUEST_CDASH_TRACK Pull Request
TEST_REPO_ALIAS TRILINOS
TRILINOS_NODE_LABEL rhel8
TRILINOS_SOURCE_REPO https://github.com/brian-kelley/Trilinos
TRILINOS_SOURCE_SHA 29fe448
TRILINOS_SRN_CONFIG true
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA 8eca3f9

Build Information

Test Name: PR_cuda-uvm

  • Build Num: 1213
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
FORCE_CLEAN true
GENCONFIG_BUILD_NAME rhel8_sems-cuda-11.4.2-gnu-10.1.0-openmpi-4.1.6_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables
PR_LABELS pkg: Ifpack2;impacting: performance;client: SPARC
PULLREQUESTNUM 13837
PULLREQUEST_CDASH_TRACK Pull Request
TEST_REPO_ALIAS TRILINOS
TRILINOS_NODE_LABEL rhel8
TRILINOS_SOURCE_REPO https://github.com/brian-kelley/Trilinos
TRILINOS_SOURCE_SHA 29fe448
TRILINOS_SRN_CONFIG true
TRILINOS_TARGET_BRANCH develop
TRILINOS_TARGET_REPO https://github.com/trilinos/Trilinos
TRILINOS_TARGET_SHA 8eca3f9

Using Repos:

Repo: TRILINOS (brian-kelley/Trilinos)
  • Branch: FusedBlockJacobiFinal
  • SHA: 29fe448
  • Mode: TEST_REPO

Pull Request Author: brian-kelley

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: SPARC Issues related to or needed more specifically by the ATDM SPARC code impacting: performance pkg: Ifpack2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants