You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement a GEMM kernel optimized for arm64 / aarch64.
Performance on an AWS c6g.xlarge instance (Graviton 2, 4 vCPU) with the generic kernel:
$ cargo test -p rten --release bench_gemm -- --nocapture --ignored
m 512 n 512 k 512 iters 1000. Duration 4093.245ms (4.093245ms/iter). GFLOPS 65.58011
m 1024 n 1024 k 1024 iters 125. Duration 4047.818ms (32.382545ms/iter). GFLOPS 66.316086
m 128 n 2048 k 512 iters 1000. Duration 4262.264ms (4.2622643ms/iter). GFLOPS 62.97955
m 2048 n 128 k 512 iters 1000. Duration 4100.306ms (4.100306ms/iter). GFLOPS 65.46718
gemm-benchmark performance for comparison, using the BLIS backend:
$ gemm-benchmark -d 1024 -t 4
Threads: 4
Iterations per thread: 1000
Matrix shape: 1024 x 1024
GFLOPS: 145.50
The OpenBLAS backend reports similar numbers (~150).
The text was updated successfully, but these errors were encountered:
#30 added an initial kernel which improves performance from ~43% to ~78% of OpenBLAS (65 => 114 GFLOPS), tested on an AWS Graviton 2, but is still relying on auto-vectorization rather than intrinsics, and doesn't do any prefetching.
There are further optimizations possible (eg. prefetching), but #32 is a decent start. Compared to the generic kernel, M=N=K=1024 performance has improved from ~66 GFLOPS on an AWS c6g.xlarge to ~135 GFLOPS.
Implement a GEMM kernel optimized for arm64 / aarch64.
Performance on an AWS c6g.xlarge instance (Graviton 2, 4 vCPU) with the generic kernel:
gemm-benchmark performance for comparison, using the BLIS backend:
The OpenBLAS backend reports similar numbers (~150).
The text was updated successfully, but these errors were encountered: