Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 GEMM kernel #27

Closed
robertknight opened this issue Jan 2, 2024 · 2 comments
Closed

aarch64 GEMM kernel #27

robertknight opened this issue Jan 2, 2024 · 2 comments
Labels
performance Issues that affect model inference or loading performance

Comments

@robertknight
Copy link
Owner

robertknight commented Jan 2, 2024

Implement a GEMM kernel optimized for arm64 / aarch64.

Performance on an AWS c6g.xlarge instance (Graviton 2, 4 vCPU) with the generic kernel:

$ cargo test -p rten --release bench_gemm -- --nocapture --ignored
m 512 n 512 k 512 iters 1000. Duration 4093.245ms (4.093245ms/iter). GFLOPS 65.58011
m 1024 n 1024 k 1024 iters 125. Duration 4047.818ms (32.382545ms/iter). GFLOPS 66.316086
m 128 n 2048 k 512 iters 1000. Duration 4262.264ms (4.2622643ms/iter). GFLOPS 62.97955
m 2048 n 128 k 512 iters 1000. Duration 4100.306ms (4.100306ms/iter). GFLOPS 65.46718

gemm-benchmark performance for comparison, using the BLIS backend:

$ gemm-benchmark -d 1024 -t 4
Threads: 4
Iterations per thread: 1000
Matrix shape: 1024 x 1024
GFLOPS: 145.50

The OpenBLAS backend reports similar numbers (~150).

@robertknight robertknight added the performance Issues that affect model inference or loading performance label Jan 2, 2024
@robertknight
Copy link
Owner Author

robertknight commented Jan 2, 2024

#30 added an initial kernel which improves performance from ~43% to ~78% of OpenBLAS (65 => 114 GFLOPS), tested on an AWS Graviton 2, but is still relying on auto-vectorization rather than intrinsics, and doesn't do any prefetching.

This was referenced Jan 5, 2024
@robertknight
Copy link
Owner Author

robertknight commented Jan 5, 2024

There are further optimizations possible (eg. prefetching), but #32 is a decent start. Compared to the generic kernel, M=N=K=1024 performance has improved from ~66 GFLOPS on an AWS c6g.xlarge to ~135 GFLOPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues that affect model inference or loading performance
Projects
None yet
Development

No branches or pull requests

1 participant