Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQ4_XS_R4 #123

Merged
merged 5 commits into from
Dec 4, 2024
Merged

IQ4_XS_R4 #123

merged 5 commits into from
Dec 4, 2024

Conversation

ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Dec 4, 2024

Follow up of #118, #119, #120, #121, #122 for IQ4_XS.

I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. IQ4_XS seemed easiest, so I tackled that one first. We get a massive speedup on ARM_NEON and a more modest (but still significant) gain on AVX2/Zen4. I'm not 100% happy with the Zen4 implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.

Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform Threads IQ4_XS IQ4_XS_R4 Speedup
ARM_NEON 8 68.23 ± 1.06 115.43 ± 0.57 1.692
Zen4 16 183.43 ± 0.60 223.98 ± 0.12 1.221
AVX2 32 195.20 ± 0.40 248.25 ± 0.43 1.272

This is a 1st working version on Zen4.
We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower
than iq4_nl_x4.
We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max,
up from 68.2 t/s for iq4_xs!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants