IQ4_XS_R4 #123

ikawrakow · 2024-12-04T14:18:01Z

Follow up of #118, #119, #120, #121, #122 for IQ4_XS.

I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. IQ4_XS seemed easiest, so I tackled that one first. We get a massive speedup on ARM_NEON and a more modest (but still significant) gain on AVX2/Zen4. I'm not 100% happy with the Zen4 implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.

Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ4_XS	IQ4_XS_R4	Speedup
ARM_NEON	8	68.23 ± 1.06	115.43 ± 0.57	1.692
Zen4	16	183.43 ± 0.60	223.98 ± 0.12	1.221
AVX2	32	195.20 ± 0.40	248.25 ± 0.43	1.272

This is a 1st working version on Zen4. We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower than iq4_nl_x4.

We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max, up from 68.2 t/s for iq4_xs!

Kawrakow added 5 commits December 4, 2024 09:16

Adding iq4_xs_r4

c43e747

This is a 1st working version on Zen4. We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower than iq4_nl_x4.

iq4_xs_r4: WIP

5b801a9

iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4

525ad3b

iq4_xs_r4: NEON

59a3e12

We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max, up from 68.2 t/s for iq4_xs!

DRY

6dec396

ikawrakow merged commit f64de08 into main Dec 4, 2024

This was referenced Dec 6, 2024

iq2_bn_r4: fastest Bitnet CPU implementation on the planet #124

Merged

Q4_0_R4 on CUDA #127

Draft

Q4_K_R4 #129

Merged

Q6_K_R4 #130

Merged

Q5_K_R4 #132

Merged

Q3_K_R4 #134

Merged

Q2_K_R4 #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQ4_XS_R4 #123

IQ4_XS_R4 #123

ikawrakow commented Dec 4, 2024 •

edited

Loading

IQ4_XS_R4 #123

IQ4_XS_R4 #123

Conversation

ikawrakow commented Dec 4, 2024 • edited Loading

ikawrakow commented Dec 4, 2024 •

edited

Loading