Bitnet #95

ikawrakow · 2024-10-19T07:02:28Z

ikawrakow
Oct 19, 2024
Maintainer

A Microsoft team has released CPU inference code for 1.58-bit Bitnets. The repo, based 100% on llama.cpp, and only adding Bitnet CPU kernels (ARM_NEON, AVX2) has 2.1k stars as of this writing. As per @Dampfinchen "this is just insanity".

Well, here we have had Bitnet inference for while. For CPU and GPU. Faster than Microsoft's by quite some margin.

There is a screen recording in their repo demoing the 3.3B Bitnet model writing a 900 token essay and achieving 71 t/s on M2 Ultra. Here is a screen recording from my M2-Max laptop (~1/2 the computing power and memory bandwidth of M2 Ultra) getting 74 t/s on the same prompt.

m2_max_cpu.mp4

And here it is running on the M2-Max 30-core GPU

m2_max_gpu.mp4

Finally, here running on RTX-4080

cuda.mp4

The prompt is very short (9 tokens), but it is still worth noting that Microsoft's implementation processes the prompt at a rate of 85 t/s, while here we get 157 t/s with half the computing power.

ikawrakow · 2024-10-19T08:44:58Z

ikawrakow
Oct 19, 2024
Maintainer Author

I was curious to see Microsoft's Bitnet performance on X86_64. So, cloned their repo and followed the setup instructions. The setup script downloaded the fp32 Bitnet-1.58-3B version, so 13.2 GB instead of 6.6. It also demands clang-18, so I had to install that first (even though llama.cpp definitely does not require clang, and even less clang-18 to be built, and at a quick glance neither do the added ternary kernels). Their "end-to-end" test script e2e_benchmark.py does not do much more than just run the familiar llama-bench. Here is what I get on my Ryzen-7950X CPU

model	size	params	backend	threads	test	t/s
bitnet 3B I2_S - 2 bpw ternary	873.66 MiB	3.32 B	CPU	16	pp512	28.19 ± 0.12
bitnet 3B I2_S - 2 bpw ternary	873.66 MiB	3.32 B	CPU	16	tg128	20.84 ± 0.03

The script warns that this is a debug build, but going to the build folder and checking shows that, nope, it is a release build. 28 t/s for PP-512 on a 3B ternary model? Hahaha.

Here is what I get with this repo:

model	size	params	backend	threads	test	t/s
bitnet 3B IQ2_BN - 2.00 bpw Bitnet	977.42 MiB	3.43 B	CPU	16	pp512	620.63 ± 3.16
bitnet 3B IQ2_BN - 2.00 bpw Bitnet	977.42 MiB	3.43 B	CPU	4	tg128	56.27 ± 0.27

22X (!!!) difference in prompt processing speed. 2.8X difference in token generation (TG) speed. TG is memory bound, so let's check what we get with just 1 thread. First theirs (be patient if you try it):

model	size	params	backend	threads	test	t/s
bitnet 3B I2_S - 2 bpw ternary	873.66 MiB	3.32 B	CPU	1	tg128	2.01 ± 0.01

Then ours

model	size	params	backend	threads	test	t/s
bitnet 3B IQ2_BN - 2.00 bpw Bitnet	977.42 MiB	3.43 B	CPU	1	tg128	25.72 ± 0.11

Aha. 12.8X.

Perhaps they did not turn on AVX2/AVX512 while building? Let's try this

python run_inference.py -m models/bitnet_b1_58-3B/ggml-model-i2_s.gguf -p "I believe the meaning of life is" -t 16
...
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 2909124194
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1

 I believe the meaning of life is . really, ... ... ... ... "..., or. ... what a...... ... ... ... just a by we or close... ar is is it is (... m ... is o to _ more _ _ full _ k _ _ good
 _ _ ( _ R _ ) P P _ and the a, the *’ P R
 B F F ( F F F F B V V
 Com Im Str
 American T



,

 
 ter “ ! M M B P IN IN S P P P O PA PA V ST IN AS B BE PA EHER B BTER B B PA

llama_perf_sampler_print:    sampling time =      15.96 ms /   136 runs   (    0.12 ms per token,  8521.84 tokens per second)
llama_perf_context_print:        load time =     390.49 ms
llama_perf_context_print: prompt eval time =     380.52 ms /     8 tokens (   47.56 ms per token,    21.02 tokens per second)
llama_perf_context_print:        eval time =    6114.10 ms /   127 runs   (   48.14 ms per token,    20.77 tokens per second)
llama_perf_context_print:       total time =    6530.61 ms /   135 tokens

Oops. AVX2 and AVX512 are both on, and we get gibberish.

Perhaps clang is mis-compiling the code? Or maybe something went wrong with the clang-18 installation? Let's try GCC.

mkdir build1 && cd build1
cmake ..
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
CMake Error at src/CMakeLists.txt:9 (message):
  Clang is required for Bitnet.cpp compilation


-- Configuring incomplete, errors occurred!

Arghh. Comment out the clang check in src/CMakeLists.txt and retry. Now it builds successfully after

cmake ..
make -j

Running llama-cli gives much better performance - 52 t/s - but still gibberish output. PP-512 is also much better - 300 t/s. That's what I would expect from a run-of-the-mill AVX2/AVX512 implementation. Still very far from being competitive.

0 replies

ikawrakow · 2024-10-19T15:19:26Z

ikawrakow
Oct 19, 2024
Maintainer Author

OK, here is apples-to-apples performance comparison on my M2-Max laptop between Microsoft's I2_S and IQ2_BN here. I used their generate-dummy-bitnet-model.py tool to generate fake Bitnet models of different sizes and ran llama-bench. Did not go beyond 30B because generating the 30B model almost exhausted my patience. Their code crashes with segmentation fault on PP-512 tests, so just TG-128.

Model	t/s (MS I2_S)	t/s (IQ2_BN)	Speedup
125M	639.39 ± 10.74	947.67 ± 34.86	1.482
350M	286.92 ± 1.35	426.03 ± 6.64	1.485
1B	144.62 ± 3.96	225.76 ± 7.70	1.561
1.5B	120.12 ± 1.31	170.55 ± 8.35	1.420
2.7B	84.25 ± 0.43	115.52 ± 3.13	1.371
3.8B	64.74 ± 0.22	86.58 ± 2.83	1.337
7B	39.14 ± 0.67	51.37 ± 0.82	1.312
13B	24.04 ± 0.03	30.21 ± 0.18	1.257
30B	11.22 ± 0.05	13.57 ± 0.03	1.209

The difference in performance decreases with model size, but that's just a matter of memory bandwidth saturation for IQ2_BN. The 30B model is 7.45 GiB, so at 13.6 t/s this is 101 GiB/s to fetch the model weights from RAM, which is basically as good as it gets on the M2-Max CPU.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bitnet #95

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Bitnet #95

ikawrakow Oct 19, 2024 Maintainer

Replies: 2 comments

ikawrakow Oct 19, 2024 Maintainer Author

ikawrakow Oct 19, 2024 Maintainer Author

ikawrakow
Oct 19, 2024
Maintainer

ikawrakow
Oct 19, 2024
Maintainer Author

ikawrakow
Oct 19, 2024
Maintainer Author