Bitnet #95
Replies: 2 comments
-
I was curious to see Microsoft's Bitnet performance on
The script warns that this is a debug build, but going to the Here is what I get with this repo:
22X (!!!) difference in prompt processing speed. 2.8X difference in token generation (TG) speed. TG is memory bound, so let's check what we get with just 1 thread. First theirs (be patient if you try it):
Then ours
Aha. 12.8X. Perhaps they did not turn on
Oops. Perhaps
Arghh. Comment out the
Running |
Beta Was this translation helpful? Give feedback.
-
OK, here is apples-to-apples performance comparison on my M2-Max laptop between Microsoft's
The difference in performance decreases with model size, but that's just a matter of memory bandwidth saturation for |
Beta Was this translation helpful? Give feedback.
-
A Microsoft team has released CPU inference code for 1.58-bit Bitnets. The repo, based 100% on
llama.cpp
, and only adding Bitnet CPU kernels (ARM_NEON, AVX2
) has 2.1k stars as of this writing. As per @Dampfinchen "this is just insanity".Well, here we have had Bitnet inference for while. For CPU and GPU. Faster than Microsoft's by quite some margin.
There is a screen recording in their repo demoing the 3.3B Bitnet model writing a 900 token essay and achieving 71 t/s on M2 Ultra. Here is a screen recording from my M2-Max laptop (~1/2 the computing power and memory bandwidth of M2 Ultra) getting 74 t/s on the same prompt.
m2_max_cpu.mp4
And here it is running on the M2-Max 30-core GPU
m2_max_gpu.mp4
Finally, here running on RTX-4080
cuda.mp4
The prompt is very short (9 tokens), but it is still worth noting that Microsoft's implementation processes the prompt at a rate of 85 t/s, while here we get 157 t/s with half the computing power.
Beta Was this translation helpful? Give feedback.
All reactions