-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements on Arm for legacy and k-quants #453
Conversation
I forgot to add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another truly outstanding change!
int8x16_t b[8]; | ||
}; | ||
|
||
// One would think this commented out version would do better than the one below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it will on different ARM microprocessors? I can test this on Raspberry Pi tomorrow.
@@ -322,7 +322,8 @@ bool llamafile_sgemm(long m, long n, long k, const void *A, long lda, const void | |||
assert(nth > 0); | |||
assert(ith < nth); | |||
|
|||
#if defined(__x86_64__) && QK_K == 256 | |||
#if QK_K == 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always wondered, why would this ever need to be something other than 256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are models where the row size is not divisible by 256. The right thing to do would have been to make it work also for such models by adding an incomplete last block. I had even started doing that, but this resulted in too many changes to the guts of ggml
, so I abandoned it and instead added the option QK_K = 64
. If it was up to me, I would remove support for QK_K = 64
, but apparently there are people who still use this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it in llamafile. There's always other quants to choose from for such models. For example, right now I'm working with stable diffusion and I was shocked to see the inner dimension of most tensors is an odd number!
@@ -77,6 +79,9 @@ static bool try_parse_ftype(const std::string & ftype_str_in, llama_ftype & ftyp | |||
return true; | |||
} | |||
} | |||
// On my system (OS Ventura 13.2.1) calling std::stoi with invalid input leads to a crash (Segmentation fault 11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can fix that after this change goes in.
Here's the improvements on my Mac Studio. Enormous gains for
The gains are also enormous on Raspberry Pi. Having 2x to 3x better is huge. I've gotten F16 to go as fast as 80 tok/sec (not sure why it doesn't anymore, could potentially be due to cooling). However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that? Once again, it could be cooling. If you have any ideas, send me a follow-up change. With tinyBLAS in many cases it'll punt control back to GGML when
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! Just ran a quick perplexity test. Despite going 3x faster, Q6_K
TinyLLaMA yields the exact same PPL before and after this change, which is 9.1482 +/- 0.13111. That's good. It means you haven't made any negative tradeoffs to achieve your considerable speedups. I measured this on my Mac Studio M2 Ultra w/ llamafile-perplexity -m /weights/TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf --temp 0 --chunks 128 -f ~/vendor/wiki.test.raw -ngl 0
TG is severely limited by memory bandwidth and hence extremely sensitive to memory access patterns. I had to experiment quite a bit to get good results for PP and TG on the M2. I guess, if RPI5 is an important target, I would need to test on that as well. |
We're only talking about ~15% so chances are it's just noise. It felt like only yesterday that TG was 2-4/s so I'm very pleased. at how fast things have progressed over the last year with these $100 computers. |
FYI, an RPI5 won't throttle with an active cooler or case fan. Anyhow you can test if a RPI5 has throttled:
If the value is different from 0x0 there is a problem, a PI can also throttle with insufficient power. https://www.raspberrypi.com/documentation/computers/os.html#get_throttled |
This PR adds matrix multiplication implementations legacy and k-quants on
__aarch64__
that are significantly more performant.The following table compares performance between the main branch and this PR for a 7B LLaMA model running on M2 Max. We observe prompt processing speed improvements of up to a factor of 3.6, and even performance gains for token generation despite this being a memory bound problem. The performance gain for
Q4_0
andQ8_0
is smaller because the main branch already uses tinyBLAS for these (i.e., the 1.6X/1.35X improvement is on top of the ~2X improvement due to tinyBLAS).As llamafile performance on my M2 Max laptop is lower compared to mainline
llama.cpp
, I also integrated into currentlamma.cpp
(build 2980, commit hashdacfcebd
) to compare the performance. The following table summarizes the results. To have apples-to-apples comparison, the performance values for the masterllama.cpp
branch were obtained with the Accelerate framework disabled. Also here performance gains are significant, up to 2.6X forQ2_K_S
.