Performance improvements on Arm for legacy and k-quants #453

ikawrakow · 2024-05-27T15:56:10Z

This PR adds matrix multiplication implementations legacy and k-quants on __aarch64__ that are significantly more performant.

The following table compares performance between the main branch and this PR for a 7B LLaMA model running on M2 Max. We observe prompt processing speed improvements of up to a factor of 3.6, and even performance gains for token generation despite this being a memory bound problem. The performance gain for Q4_0 and Q8_0 is smaller because the main branch already uses tinyBLAS for these (i.e., the 1.6X/1.35X improvement is on top of the ~2X improvement due to tinyBLAS).

cpu_info	model_filename	size	test	t/s (main)	t/s (PR)	Speedup
Apple M2 Max (+fp16+dotprod)	q80	6.67 GiB	pp512	63.33	85.46	1.599
Apple M2 Max (+fp16+dotprod)	q40	3.56 GiB	pp512	55.65	88.97	1.349
Apple M2 Max (+fp16+dotprod)	q41	3.95 GiB	pp512	22.51	75.98	3.375
Apple M2 Max (+fp16+dotprod)	q50	4.33 GiB	pp512	19.94	71.91	3.606
Apple M2 Max (+fp16+dotprod)	q51	4.72 GiB	pp512	17.42	61.54	3.533
Apple M2 Max (+fp16+dotprod)	q2ks	2.16 GiB	pp512	23.01	69.15	3.001
Apple M2 Max (+fp16+dotprod)	q3ks	2.75 GiB	pp512	16.98	52.05	3.065
Apple M2 Max (+fp16+dotprod)	q4ks	3.59 GiB	pp512	25.88	74.59	2.882
Apple M2 Max (+fp16+dotprod)	q5ks	4.33 GiB	pp512	19.58	57.69	2.946
Apple M2 Max (+fp16+dotprod)	q6k	5.15 GiB	pp512	18.17	52.79	2.905
Apple M2 Max (+fp16+dotprod)	iq4xs	3.37 GiB	pp512	23.72	72.03	3.037
Apple M2 Max (+fp16+dotprod)	q80	6.67 GiB	tg128	15.68	16.27	1.038
Apple M2 Max (+fp16+dotprod)	q40	3.56 GiB	tg128	27.06	27.63	1.021
Apple M2 Max (+fp16+dotprod)	q41	3.95 GiB	tg128	19.44	25.24	1.298
Apple M2 Max (+fp16+dotprod)	q50	4.33 GiB	tg128	17.46	19.22	1.101
Apple M2 Max (+fp16+dotprod)	q51	4.72 GiB	tg128	15.25	17.99	1.180
Apple M2 Max (+fp16+dotprod)	q2ks	2.16 GiB	tg128	19.64	26.14	1.331
Apple M2 Max (+fp16+dotprod)	q3ks	2.75 GiB	tg128	15.07	18.00	1.194
Apple M2 Max (+fp16+dotprod)	q4ks	3.59 GiB	tg128	21.59	26.93	1.247
Apple M2 Max (+fp16+dotprod)	q5ks	4.33 GiB	tg128	17.49	18.75	1.072
Apple M2 Max (+fp16+dotprod)	q6k	5.15 GiB	tg128	15.75	19.97	1.268
Apple M2 Max (+fp16+dotprod)	iq4xs	3.37 GiB	tg128	21.14	23.30	1.102

As llamafile performance on my M2 Max laptop is lower compared to mainline llama.cpp, I also integrated into current lamma.cpp (build 2980, commit hash dacfcebd) to compare the performance. The following table summarizes the results. To have apples-to-apples comparison, the performance values for the master llama.cpp branch were obtained with the Accelerate framework disabled. Also here performance gains are significant, up to 2.6X for Q2_K_S.

model	size	params	test	t/s (master)	t/s (PR)	Speedup
llama 7B Q8_0	6.67 GiB	6.74 B	pp512	78.17 ± 1.18	96.78 ± 0.25	1.238
llama 7B Q4_0	3.56 GiB	6.74 B	pp512	68.04 ± 1.18	79.32 ± 0.76	1.166
llama 7B Q4_1	3.95 GiB	6.74 B	pp512	37.51 ± 0.61	67.96 ± 0.74	1.812
llama 7B Q5_0	4.33 GiB	6.74 B	pp512	30.24 ± 0.12	70.86 ± 0.03	2.343
llama 7B Q5_1	4.72 GiB	6.74 B	pp512	26.27 ± 0.09	60.84 ± 0.05	2.316
llama 7B Q2_K_S	2.16 GiB	6.74 B	pp512	32.98 ± 1.47	85.53 ± 0.20	2.593
llama 7B Q3_K_S	2.75 GiB	6.74 B	pp512	26.01 ± 0.02	62.02 ± 0.73	2.385
llama 7B Q4_K_S	3.59 GiB	6.74 B	pp512	44.62 ± 0.80	77.01 ± 1.22	1.726
llama 7B Q5_K_S	4.33 GiB	6.74 B	pp512	29.31 ± 0.04	69.16 ± 1.17	2.360
llama 7B Q6_K	5.15 GiB	6.74 B	pp512	28.07 ± 0.03	62.85 ± 0.96	2.239
llama 7B Q8_0	6.67 GiB	6.74 B	tg128	16.35 ± 0.10	16.74 ± 0.06	1.024
llama 7B Q4_0	3.56 GiB	6.74 B	tg128	27.28 ± 0.10	29.59 ± 0.08	1.085
llama 7B Q4_1	3.95 GiB	6.74 B	tg128	25.15 ± 0.16	26.97 ± 0.13	1.072
llama 7B Q5_0	4.33 GiB	6.74 B	tg128	22.08 ± 0.83	24.18 ± 0.15	1.095
llama 7B Q5_1	4.72 GiB	6.74 B	tg128	20.45 ± 0.45	21.73 ± 0.26	1.063
llama 7B Q2_K_S	2.16 GiB	6.74 B	tg128	28.34 ± 0.20	37.59 ± 0.32	1.326
llama 7B Q3_K_S	2.75 GiB	6.74 B	tg128	22.73 ± 0.03	26.08 ± 0.09	1.146
llama 7B Q4_K_S	3.59 GiB	6.74 B	tg128	26.56 ± 0.10	27.82 ± 0.32	1.047
llama 7B Q5_K_S	4.33 GiB	6.74 B	tg128	22.11 ± 0.18	23.73 ± 0.12	1.074
llama 7B Q6_K_S	5.15 GiB	6.74 B	tg128	19.45 ± 0.13	20.52 ± 0.06	1.055

ikawrakow · 2024-05-27T16:14:31Z

I forgot to add a Q8_0 implementation (required because of the reordering of the quantized activations), so converting to draft until I add it.

jart

Another truly outstanding change!

jart · 2024-05-27T18:15:17Z

llamafile/iqk_mul_mat.inc

+    int8x16_t b[8];
+};
+
+// One would think this commented out version would do better than the one below


Maybe it will on different ARM microprocessors? I can test this on Raspberry Pi tomorrow.

jart · 2024-05-27T18:16:30Z

llamafile/tinyblas_cpu_sgemm.inc

@@ -322,7 +322,8 @@ bool llamafile_sgemm(long m, long n, long k, const void *A, long lda, const void
    assert(nth > 0);
    assert(ith < nth);

-#if defined(__x86_64__) && QK_K == 256
+#if QK_K == 256


I've always wondered, why would this ever need to be something other than 256?

There are models where the row size is not divisible by 256. The right thing to do would have been to make it work also for such models by adding an incomplete last block. I had even started doing that, but this resulted in too many changes to the guts of ggml, so I abandoned it and instead added the option QK_K = 64. If it was up to me, I would remove support for QK_K = 64, but apparently there are people who still use this option.

We can remove it in llamafile. There's always other quants to choose from for such models. For example, right now I'm working with stable diffusion and I was shocked to see the inner dimension of most tensors is an odd number!

llama.cpp/ggml-common.h

jart · 2024-05-27T18:24:49Z

llama.cpp/quantize/quantize.cpp

@@ -77,6 +79,9 @@ static bool try_parse_ftype(const std::string & ftype_str_in, llama_ftype & ftyp
            return true;
        }
    }
+    // On my system (OS Ventura 13.2.1) calling std::stoi with invalid input leads to a crash (Segmentation fault 11)


I can fix that after this change goes in.

jart · 2024-05-29T23:38:23Z

Here's the improvements on my Mac Studio. Enormous gains for Q5_K_M, Q6_K, and Q5_0!! I'm actually very pleased that you're optimizing the legacy quants too, due to weird new models like IBM Granite 34b.

cpu_info	model_filename	size	test	t/s before	t/s after	t/s speedup
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	pp512	693.92	883.96	1.27x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	tg16	70.39	103.10	1.46x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	pp512	222.32	617.74	2.78x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	tg16	96.01	96.93	1.01x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	pp512	244.09	658.62	2.70x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	tg16	93.74	103.06	1.10x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	pp512	245.62	809.91	3.30x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	tg16	96.11	106.78	1.11x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q4_0	606.53 MiB	pp512	625.47	943.14	1.51x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q4_0	606.53 MiB	tg16	129.34	124.60	0.96x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	pp512	249.27	694.66	2.79x
Apple M2 Ultra (+fp16+dotprod)	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	tg16	108.34	105.45	0.97x

The gains are also enormous on Raspberry Pi. Having 2x to 3x better is huge. I've gotten F16 to go as fast as 80 tok/sec (not sure why it doesn't anymore, could potentially be due to cooling). However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that? Once again, it could be cooling. If you have any ideas, send me a follow-up change. With tinyBLAS in many cases it'll punt control back to GGML when n=1. The special codepaths should only run when they add value.

cpu_info	model_filename	size	test	t/s before	t/s after	t/s speedup
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.F16	2.05 GiB	pp512	66.53	66.53	1.00x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.F16	2.05 GiB	tg16	4.26	4.26	1.00x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	pp512	44.92	55.41	1.23x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q8_0	1.09 GiB	tg16	8.38	7.90	0.94x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	pp512	18.20	37.59	2.07x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q6_K	860.86 MiB	tg16	11.48	9.66	0.84x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	pp512	19.38	41.25	2.13x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_K_M	745.11 MiB	tg16	13.41	10.22	0.76x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	pp512	17.64	46.45	2.63x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q5_0	729.84 MiB	tg16	11.83	11.12	0.94x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	pp512	18.80	44.74	2.38x
+fp16+dotprod	TinyLlama-1.1B-Chat-v1.0.Q2_K	411.41 MiB	tg16	14.54	14.79	1.02x

jart

Approved! Just ran a quick perplexity test. Despite going 3x faster, Q6_K TinyLLaMA yields the exact same PPL before and after this change, which is 9.1482 +/- 0.13111. That's good. It means you haven't made any negative tradeoffs to achieve your considerable speedups. I measured this on my Mac Studio M2 Ultra w/ llamafile-perplexity -m /weights/TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf --temp 0 --chunks 128 -f ~/vendor/wiki.test.raw -ngl 0

ikawrakow · 2024-05-30T06:53:34Z

However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that?

TG is severely limited by memory bandwidth and hence extremely sensitive to memory access patterns. I had to experiment quite a bit to get good results for PP and TG on the M2. I guess, if RPI5 is an important target, I would need to test on that as well.

jart · 2024-05-30T08:14:06Z

We're only talking about ~15% so chances are it's just noise. It felt like only yesterday that TG was 2-4/s so I'm very pleased. at how fast things have progressed over the last year with these $100 computers.

Janghou · 2024-06-25T08:04:05Z

FYI, an RPI5 won't throttle with an active cooler or case fan.

Anyhow you can test if a RPI5 has throttled:

> vcgencmd get_throttled
throttled=0x0

If the value is different from 0x0 there is a problem, a PI can also throttle with insufficient power.

https://www.raspberrypi.com/documentation/computers/os.html#get_throttled

Performance improvements on Arm for legacy and k-quants

12ad6a1

github-actions bot added llama.cpp llamafile labels May 27, 2024

ikawrakow marked this pull request as draft May 27, 2024 16:12

Add Q8_0

cccd70b

ikawrakow marked this pull request as ready for review May 27, 2024 17:17

jart reviewed May 27, 2024

View reviewed changes

jart approved these changes May 30, 2024

View reviewed changes

jart merged commit 293a528 into Mozilla-Ocho:main May 30, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements on Arm for legacy and k-quants #453

Performance improvements on Arm for legacy and k-quants #453

ikawrakow commented May 27, 2024 •

edited

Loading

ikawrakow commented May 27, 2024

jart left a comment

jart May 27, 2024

jart May 27, 2024

ikawrakow May 28, 2024

jart May 29, 2024

jart May 27, 2024

jart commented May 29, 2024

jart left a comment

ikawrakow commented May 30, 2024

jart commented May 30, 2024

Janghou commented Jun 25, 2024

Performance improvements on Arm for legacy and k-quants #453

Performance improvements on Arm for legacy and k-quants #453

Conversation

ikawrakow commented May 27, 2024 • edited Loading

ikawrakow commented May 27, 2024

jart left a comment

Choose a reason for hiding this comment

jart May 27, 2024

Choose a reason for hiding this comment

jart May 27, 2024

Choose a reason for hiding this comment

ikawrakow May 28, 2024

Choose a reason for hiding this comment

jart May 29, 2024

Choose a reason for hiding this comment

jart May 27, 2024

Choose a reason for hiding this comment

jart commented May 29, 2024

jart left a comment

Choose a reason for hiding this comment

ikawrakow commented May 30, 2024

jart commented May 30, 2024

Janghou commented Jun 25, 2024

ikawrakow commented May 27, 2024 •

edited

Loading