CPU beating GPU in token generation speed #18

ikawrakow · 2024-08-13T10:12:46Z

ikawrakow
Aug 13, 2024
Maintainer

The TriLM ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with IQ2_TN. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:

backend	threads	test	t/s
Ryzen-7950X	16	pp1500	8268.11 ± 48.34
Ryzen-7950X	4	tg500	1016.65 ± 22.17
Ryzen-7950X	8	tg500	1224.83 ± 32.28
Ryzen-7950X	16	tg500	1240.54 ± 25.74
RTX-4080	-	pp1500	110388 ± 250
RTX-4080	-	tg500	1136.64 ± 4.99

The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed!

I also have an M2-Max laptop (the version with a 30-core GPU). Here is what we get:

backend	threads	test	t/s
M2-Max CPU	8	pp1500	5209.27 ± 21.48
M2-Max CPU	2	tg500	692.87 ± 1.74
M2-Max CPU	4	tg500	841.48 ± 5.96
M2-Max CPU	8	tg500	894.73 ± 10.03
M2-Max GPU	4	pp1500	25824 ± 562
M2-Max GPU	4	tg500	464.86 ± 3.85

Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads!

ikawrakow · 2024-09-02T13:20:54Z

ikawrakow
Sep 2, 2024
Maintainer Author

Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get

model	size	params	backend	ngl	threads	fa	test	t/s
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CUDA	100	1	1	pp1500	156827.38 ± 727
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CUDA	100	1	1	tg500	1496.37 ± 36.79
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	0	16	1	pp1500	12133.80 ± 51.45
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	0	16	1	tg500	1509.52 ± 9.65

TG speed is now about the same, which is still quite remarkable.

FA has improved CPU prompt processing speed by almost 50%, TG by 22%.

0 replies

ikawrakow · 2024-09-08T07:16:59Z

ikawrakow
Sep 8, 2024
Maintainer Author

With PR #42 we get this

model	size	params	backend	threads	fa	test	t/s
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	16	1	pp1500	12906.95 ± 61.04
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	16	1	tg512	1563.62 ± 12.55

I.e., 56% improvement for PP and 26% improvement for TG since the original post from Aug 13!

I see PR-8151, which provides dedicated quantization for the TriLM ternary models in mainline llama.cpp, has been merged. Here is what we get for TQ2_0 that corresponds to our IQ2_TN

model	size	params	backend	threads	fa	test	t/s
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	1	pp1500	5187.34 ± 11.69
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	0	pp1500	5281.54 ± 53.33
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	1	tg500	1156.25 ± 18.14
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	0	tg500	1041.27 ± 21.30

Our version is 2.44X faster for PP and 35% faster for TG.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU beating GPU in token generation speed #18

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

CPU beating GPU in token generation speed #18

ikawrakow Aug 13, 2024 Maintainer

Replies: 2 comments

ikawrakow Sep 2, 2024 Maintainer Author

ikawrakow Sep 8, 2024 Maintainer Author

ikawrakow
Aug 13, 2024
Maintainer

ikawrakow
Sep 2, 2024
Maintainer Author

ikawrakow
Sep 8, 2024
Maintainer Author