Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q6_K_R4 #130

Merged
merged 6 commits into from
Dec 10, 2024
Merged

Q6_K_R4 #130

merged 6 commits into from
Dec 10, 2024

Conversation

ikawrakow
Copy link
Owner

Follow up of #118, #119, #120, #121, #122, #123, #129 for Q6_K.

If nothing else Q6_K is routinely used for the output tensor, so having a better Q6_K performance would be useful.

We get a large speedup on ARM_NEON and non-negligible gains on AVX2/Zen4. Here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads Q6_K Q6_K_R4 Speedup
ARM_NEON 8 57.57 ± 0.61 83.25 ± 0.81 1.446
Zen4 16 195.20 ± 0.74 243.25 ± 0.31 1.246
AVX2 32 194.51 ± 0.35 264.16 ± 0.44 1.358

Except on ARM_NEON, where TG performance is slightly lower for small numbers of threads, we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:

Platform Threads Q6_K Q6_K_R4 Speedup
ARM_NEON 2 7.46 ± 0.03 7.35 ± 0.01 0.985
4 13.88 ± 0.02 13.80 ± 0.01 0.994
8 18.31 ± 0.16 18.57 ± 0.14 1.014
Zen4 1 5.38 ± 0.00 7.94 ± 0.00 1.476
2 8.93 ± 0.00 10.38 ± 0.00 1.162
4 9.97 ± 0.27 10.18 ± 0.01 1.021
AVX2 2 4.75 ± 0.00 5.78 ± 0.01 1.217
4 7.57 ± 0.00 8.47 ± 0.00 1.119
8 8.23 ± 0.00 9.14 ± 0.00 1.111

With this Zen4 implementation, for TG the available memory bandwidth is fully saturated with just 2 threads!

"Simple" as in processing 4 instead of 8 rows at once.
On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs
195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s
vs 5.38 t/s for Q6_K.
PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K.
TG-128 is slightly lower rthan q6_K for low number of threads,
becomes very slightly better at 8 threads.
PP-512(LLaMA-3.1-8B) = 83.25 t/s
238.3 t/s -> 243.2 t/s
@ikawrakow ikawrakow merged commit 361174e into main Dec 10, 2024
This was referenced Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants