Optimize aarch64 GEMM kernel #32

robertknight · 2024-01-05T23:07:39Z

Revise aarch64 kernel to use SIMD intrinsics. The structure is the same as the AVX 2 / FMA kernel, but the tile size is set to 8x8 as that performed best. I did not port the prefetching logic as std::arch::aarch64::_prefetch is currently nightly only :(

On an M1 Mac (AWS mac2.metal) performance for an M=N=K=1024 matmul increases from ~334 to ~418 GFLOPS.
On a Graviton 2 (AWS c6g.xlarge) performance for an M=N=K=1024 matmul increases from ~115 to ~136 GFLOPS.

One observation from my tests is that ARM performance is much more sensitive to whether unrolling is used than the AVX kernel. Performance drops ~40 GFLOPS without it on the M1 in this test.

See #27

Revise aarch64 kernel to use SIMD intrinsics. The structure is the same as the AVX 2 / FMA kernel, but the tile size is set to 8x8 as that performed best. On an M1 Mac performance for an M=N=K=1024 matmul increases from ~334 to ~418 GFLOPS.

Optimize aarch64 GEMM kernel

31622c1

Revise aarch64 kernel to use SIMD intrinsics. The structure is the same as the AVX 2 / FMA kernel, but the tile size is set to 8x8 as that performed best. On an M1 Mac performance for an M=N=K=1024 matmul increases from ~334 to ~418 GFLOPS.

robertknight force-pushed the aarch64-gemm-v2 branch from bc57ee8 to 31622c1 Compare January 5, 2024 23:32

robertknight merged commit 7350002 into main Jan 5, 2024
1 check passed

robertknight deleted the aarch64-gemm-v2 branch January 5, 2024 23:35

robertknight mentioned this pull request Jan 5, 2024

aarch64 GEMM kernel #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize aarch64 GEMM kernel #32

Optimize aarch64 GEMM kernel #32

robertknight commented Jan 5, 2024 •

edited

Loading

Optimize aarch64 GEMM kernel #32

Optimize aarch64 GEMM kernel #32

Conversation

robertknight commented Jan 5, 2024 • edited Loading

robertknight commented Jan 5, 2024 •

edited

Loading