Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements on Arm for legacy and k-quants #453

Merged
merged 2 commits into from
May 30, 2024

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented May 27, 2024

This PR adds matrix multiplication implementations legacy and k-quants on __aarch64__ that are significantly more performant.

The following table compares performance between the main branch and this PR for a 7B LLaMA model running on M2 Max. We observe prompt processing speed improvements of up to a factor of 3.6, and even performance gains for token generation despite this being a memory bound problem. The performance gain for Q4_0 and Q8_0 is smaller because the main branch already uses tinyBLAS for these (i.e., the 1.6X/1.35X improvement is on top of the ~2X improvement due to tinyBLAS).

cpu_info model_filename size test t/s (main) t/s (PR) Speedup
Apple M2 Max (+fp16+dotprod) q80 6.67 GiB pp512 63.33 85.46 1.599
Apple M2 Max (+fp16+dotprod) q40 3.56 GiB pp512 55.65 88.97 1.349
Apple M2 Max (+fp16+dotprod) q41 3.95 GiB pp512 22.51 75.98 3.375
Apple M2 Max (+fp16+dotprod) q50 4.33 GiB pp512 19.94 71.91 3.606
Apple M2 Max (+fp16+dotprod) q51 4.72 GiB pp512 17.42 61.54 3.533
Apple M2 Max (+fp16+dotprod) q2ks 2.16 GiB pp512 23.01 69.15 3.001
Apple M2 Max (+fp16+dotprod) q3ks 2.75 GiB pp512 16.98 52.05 3.065
Apple M2 Max (+fp16+dotprod) q4ks 3.59 GiB pp512 25.88 74.59 2.882
Apple M2 Max (+fp16+dotprod) q5ks 4.33 GiB pp512 19.58 57.69 2.946
Apple M2 Max (+fp16+dotprod) q6k 5.15 GiB pp512 18.17 52.79 2.905
Apple M2 Max (+fp16+dotprod) iq4xs 3.37 GiB pp512 23.72 72.03 3.037
Apple M2 Max (+fp16+dotprod) q80 6.67 GiB tg128 15.68 16.27 1.038
Apple M2 Max (+fp16+dotprod) q40 3.56 GiB tg128 27.06 27.63 1.021
Apple M2 Max (+fp16+dotprod) q41 3.95 GiB tg128 19.44 25.24 1.298
Apple M2 Max (+fp16+dotprod) q50 4.33 GiB tg128 17.46 19.22 1.101
Apple M2 Max (+fp16+dotprod) q51 4.72 GiB tg128 15.25 17.99 1.180
Apple M2 Max (+fp16+dotprod) q2ks 2.16 GiB tg128 19.64 26.14 1.331
Apple M2 Max (+fp16+dotprod) q3ks 2.75 GiB tg128 15.07 18.00 1.194
Apple M2 Max (+fp16+dotprod) q4ks 3.59 GiB tg128 21.59 26.93 1.247
Apple M2 Max (+fp16+dotprod) q5ks 4.33 GiB tg128 17.49 18.75 1.072
Apple M2 Max (+fp16+dotprod) q6k 5.15 GiB tg128 15.75 19.97 1.268
Apple M2 Max (+fp16+dotprod) iq4xs 3.37 GiB tg128 21.14 23.30 1.102

As llamafile performance on my M2 Max laptop is lower compared to mainline llama.cpp, I also integrated into current lamma.cpp (build 2980, commit hash dacfcebd) to compare the performance. The following table summarizes the results. To have apples-to-apples comparison, the performance values for the master llama.cpp branch were obtained with the Accelerate framework disabled. Also here performance gains are significant, up to 2.6X for Q2_K_S.

model size params test t/s (master) t/s (PR) Speedup
llama 7B Q8_0 6.67 GiB 6.74 B pp512 78.17 ± 1.18 96.78 ± 0.25 1.238
llama 7B Q4_0 3.56 GiB 6.74 B pp512 68.04 ± 1.18 79.32 ± 0.76 1.166
llama 7B Q4_1 3.95 GiB 6.74 B pp512 37.51 ± 0.61 67.96 ± 0.74 1.812
llama 7B Q5_0 4.33 GiB 6.74 B pp512 30.24 ± 0.12 70.86 ± 0.03 2.343
llama 7B Q5_1 4.72 GiB 6.74 B pp512 26.27 ± 0.09 60.84 ± 0.05 2.316
llama 7B Q2_K_S 2.16 GiB 6.74 B pp512 32.98 ± 1.47 85.53 ± 0.20 2.593
llama 7B Q3_K_S 2.75 GiB 6.74 B pp512 26.01 ± 0.02 62.02 ± 0.73 2.385
llama 7B Q4_K_S 3.59 GiB 6.74 B pp512 44.62 ± 0.80 77.01 ± 1.22 1.726
llama 7B Q5_K_S 4.33 GiB 6.74 B pp512 29.31 ± 0.04 69.16 ± 1.17 2.360
llama 7B Q6_K 5.15 GiB 6.74 B pp512 28.07 ± 0.03 62.85 ± 0.96 2.239
llama 7B Q8_0 6.67 GiB 6.74 B tg128 16.35 ± 0.10 16.74 ± 0.06 1.024
llama 7B Q4_0 3.56 GiB 6.74 B tg128 27.28 ± 0.10 29.59 ± 0.08 1.085
llama 7B Q4_1 3.95 GiB 6.74 B tg128 25.15 ± 0.16 26.97 ± 0.13 1.072
llama 7B Q5_0 4.33 GiB 6.74 B tg128 22.08 ± 0.83 24.18 ± 0.15 1.095
llama 7B Q5_1 4.72 GiB 6.74 B tg128 20.45 ± 0.45 21.73 ± 0.26 1.063
llama 7B Q2_K_S 2.16 GiB 6.74 B tg128 28.34 ± 0.20 37.59 ± 0.32 1.326
llama 7B Q3_K_S 2.75 GiB 6.74 B tg128 22.73 ± 0.03 26.08 ± 0.09 1.146
llama 7B Q4_K_S 3.59 GiB 6.74 B tg128 26.56 ± 0.10 27.82 ± 0.32 1.047
llama 7B Q5_K_S 4.33 GiB 6.74 B tg128 22.11 ± 0.18 23.73 ± 0.12 1.074
llama 7B Q6_K_S 5.15 GiB 6.74 B tg128 19.45 ± 0.13 20.52 ± 0.06 1.055

@ikawrakow ikawrakow marked this pull request as draft May 27, 2024 16:12
@ikawrakow
Copy link
Contributor Author

I forgot to add a Q8_0 implementation (required because of the reordering of the quantized activations), so converting to draft until I add it.

@ikawrakow ikawrakow marked this pull request as ready for review May 27, 2024 17:17
Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another truly outstanding change!

int8x16_t b[8];
};

// One would think this commented out version would do better than the one below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it will on different ARM microprocessors? I can test this on Raspberry Pi tomorrow.

@@ -322,7 +322,8 @@ bool llamafile_sgemm(long m, long n, long k, const void *A, long lda, const void
assert(nth > 0);
assert(ith < nth);

#if defined(__x86_64__) && QK_K == 256
#if QK_K == 256
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always wondered, why would this ever need to be something other than 256?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are models where the row size is not divisible by 256. The right thing to do would have been to make it work also for such models by adding an incomplete last block. I had even started doing that, but this resulted in too many changes to the guts of ggml, so I abandoned it and instead added the option QK_K = 64. If it was up to me, I would remove support for QK_K = 64, but apparently there are people who still use this option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it in llamafile. There's always other quants to choose from for such models. For example, right now I'm working with stable diffusion and I was shocked to see the inner dimension of most tensors is an odd number!

llama.cpp/ggml-common.h Show resolved Hide resolved
@@ -77,6 +79,9 @@ static bool try_parse_ftype(const std::string & ftype_str_in, llama_ftype & ftyp
return true;
}
}
// On my system (OS Ventura 13.2.1) calling std::stoi with invalid input leads to a crash (Segmentation fault 11)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fix that after this change goes in.

@jart
Copy link
Collaborator

jart commented May 29, 2024

Here's the improvements on my Mac Studio. Enormous gains for Q5_K_M, Q6_K, and Q5_0!! I'm actually very pleased that you're optimizing the legacy quants too, due to weird new models like IBM Granite 34b.

cpu_info model_filename size test t/s before t/s after t/s speedup
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB pp512 693.92 883.96 1.27x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB tg16 70.39 103.10 1.46x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB pp512 222.32 617.74 2.78x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB tg16 96.01 96.93 1.01x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB pp512 244.09 658.62 2.70x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB tg16 93.74 103.06 1.10x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB pp512 245.62 809.91 3.30x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB tg16 96.11 106.78 1.11x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q4_0 606.53 MiB pp512 625.47 943.14 1.51x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q4_0 606.53 MiB tg16 129.34 124.60 0.96x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB pp512 249.27 694.66 2.79x
Apple M2 Ultra (+fp16+dotprod) TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB tg16 108.34 105.45 0.97x

The gains are also enormous on Raspberry Pi. Having 2x to 3x better is huge. I've gotten F16 to go as fast as 80 tok/sec (not sure why it doesn't anymore, could potentially be due to cooling). However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that? Once again, it could be cooling. If you have any ideas, send me a follow-up change. With tinyBLAS in many cases it'll punt control back to GGML when n=1. The special codepaths should only run when they add value.

cpu_info model_filename size test t/s before t/s after t/s speedup
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.F16 2.05 GiB pp512 66.53 66.53 1.00x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.F16 2.05 GiB tg16 4.26 4.26 1.00x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB pp512 44.92 55.41 1.23x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q8_0 1.09 GiB tg16 8.38 7.90 0.94x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB pp512 18.20 37.59 2.07x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q6_K 860.86 MiB tg16 11.48 9.66 0.84x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB pp512 19.38 41.25 2.13x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_K_M 745.11 MiB tg16 13.41 10.22 0.76x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB pp512 17.64 46.45 2.63x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q5_0 729.84 MiB tg16 11.83 11.12 0.94x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB pp512 18.80 44.74 2.38x
+fp16+dotprod TinyLlama-1.1B-Chat-v1.0.Q2_K 411.41 MiB tg16 14.54 14.79 1.02x

Copy link
Collaborator

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Just ran a quick perplexity test. Despite going 3x faster, Q6_K TinyLLaMA yields the exact same PPL before and after this change, which is 9.1482 +/- 0.13111. That's good. It means you haven't made any negative tradeoffs to achieve your considerable speedups. I measured this on my Mac Studio M2 Ultra w/ llamafile-perplexity -m /weights/TinyLlama-1.1B-Chat-v1.0.Q6_K.gguf --temp 0 --chunks 128 -f ~/vendor/wiki.test.raw -ngl 0

@jart jart merged commit 293a528 into Mozilla-Ocho:main May 30, 2024
1 check passed
@ikawrakow
Copy link
Contributor Author

However I'm noticing that prediction is slowing down a bit on RPI5. Did you do anything to change that?

TG is severely limited by memory bandwidth and hence extremely sensitive to memory access patterns. I had to experiment quite a bit to get good results for PP and TG on the M2. I guess, if RPI5 is an important target, I would need to test on that as well.

@jart
Copy link
Collaborator

jart commented May 30, 2024

We're only talking about ~15% so chances are it's just noise. It felt like only yesterday that TG was 2-4/s so I'm very pleased. at how fast things have progressed over the last year with these $100 computers.

@Janghou
Copy link

Janghou commented Jun 25, 2024

FYI, an RPI5 won't throttle with an active cooler or case fan.

Anyhow you can test if a RPI5 has throttled:

> vcgencmd get_throttled
throttled=0x0

If the value is different from 0x0 there is a problem, a PI can also throttle with insufficient power.

https://www.raspberrypi.com/documentation/computers/os.html#get_throttled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants