CPU beating GPU in token generation speed #18
Replies: 2 comments
-
Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get
TG speed is now about the same, which is still quite remarkable. FA has improved CPU prompt processing speed by almost 50%, TG by 22%. |
Beta Was this translation helpful? Give feedback.
-
With PR #42 we get this
I.e., 56% improvement for PP and 26% improvement for TG since the original post from Aug 13! I see PR-8151, which provides dedicated quantization for the TriLM ternary models in mainline
Our version is 2.44X faster for PP and 35% faster for TG. |
Beta Was this translation helpful? Give feedback.
-
The TriLM ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with
IQ2_TN
. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed!
I also have an M2-Max laptop (the version with a 30-core GPU). Here is what we get:
Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads!
Beta Was this translation helpful? Give feedback.
All reactions