Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Revise aarch64 kernel to use SIMD intrinsics. The structure is the same as the AVX 2 / FMA kernel, but the tile size is set to 8x8 as that performed best. I did not port the prefetching logic as
std::arch::aarch64::_prefetch
is currently nightly only :(On an M1 Mac (AWS mac2.metal) performance for an M=N=K=1024 matmul increases from ~334 to ~418 GFLOPS.
On a Graviton 2 (AWS c6g.xlarge) performance for an M=N=K=1024 matmul increases from ~115 to ~136 GFLOPS.
One observation from my tests is that ARM performance is much more sensitive to whether unrolling is used than the AVX kernel. Performance drops ~40 GFLOPS without it on the M1 in this test.
See #27