Skip to content

Commit

Permalink
Add benchmark results and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
skeskinen committed Apr 27, 2023
1 parent 9291207 commit 24f2bc2
Show file tree
Hide file tree
Showing 39 changed files with 2,386 additions and 97 deletions.
97 changes: 97 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# bert.cpp

[ggml](https://github.com/ggerganov/ggml) inference of BERT neural net architecture with pooling and normalization from [SentenceTransformers (sbert.net)](https://sbert.net/).
High quality sentence embeddings in pure C++ (or C).

## Description
The main goal of `bert.cpp` is to run the BERT model using 4-bit integer quantization on CPU

* Plain C/C++ implementation without dependencies
* Inherit support for various architectures from ggml (x86 with AVX2, ARM, etc.)
* Choose your model size from 32/16/4 bits per model weigth
* all-MiniLM-L6-v2 with 4bit quantization is only 14MB. Inference RAM usage depends on the length of the input
* Sample cpp server over tcp socket and a python test client
* Benchmarks to validate correctness and speed of inference

## Limitations & TODO
* Tokenizer doesn't correctly handle asian writing (CJK, maybe others)
* Inputs longer than ctx size are not truncated. If you are trying to make embeddings for longer texts make sure to truncate.
* bert.cpp doesn't respect tokenizer, pooling or normalization settings from the model card:
* All inputs are lowercased and trimmed
* All outputs are mean pooled and normalized
* The API is in C++ (uses things from std::)

## Usage

### Build
```sh
mkdir build
cd build
cmake ..
make
cd ..
```
### Download models
```sh
pip3 install -r requirements.txt
# python3 models/download-ggml.py list_models
python3 models/download-ggml.py download all-MiniLM-L6-v2 q4_0
```
### Start sample server
```sh
./build/bin/server -m models/all-MiniLM-L6-v2/ggml-model-q4_0.bin

# bert_model_load: loading model from 'models/all-MiniLM-L6-v2/ggml-model-q4_0.bin' - please wait ...
# bert_model_load: n_vocab = 30522
# bert_model_load: n_ctx = 512
# bert_model_load: n_embd = 384
# bert_model_load: n_intermediate = 1536
# bert_model_load: n_head = 12
# bert_model_load: n_layer = 6
# bert_model_load: f16 = 2
# bert_model_load: ggml ctx size = 13.57 MB
# bert_model_load: ............ done
# bert_model_load: model size = 13.55 MB / num tensors = 101
# Server running on port 8080 with 4 threads
```
### Run sample client




## Benchmarks
Running MTEB (Massive Text Embedding Benchmark) with bert.cpp vs. [sbert](https://sbert.net/)(cpu mode) gives comparable results between the two, with quantization having minimal effect on accuracy and eval time being similar or better than sbert with batch_size=1 (bert.cpp doesn't support batching).

See [benchmarks](benchmarks) more info.
### all-MiniLM-L6-v2
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
|-----------|-----------|------------|-----------|------------|
| f16 | 0.8201 | 7.52 | 0.4085 | 12.25 |
| f32 | 0.8201 | 8.22 | 0.4082 | 13.65 |
| q4_0 | 0.8175 | 6.87 | 0.3911 | 11.22 |
| q4_1 | 0.8214 | 13.26 | 0.4015 | 21.37 |
| sbert | 0.8203 | 2.85 | 0.4085 | 7.28 |
| sbert-batchless | 0.8203 | 12.48 | 0.4085 | 15.27 |


### all-MiniLM-L12-v2
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
|-----------|-----------|------------|-----------|------------|
| f16 | 0.8306 | 14.66 | 0.4119 | 23.20 |
| f32 | 0.8306 | 16.18 | 0.4117 | 25.79 |
| q4_0 | 0.8310 | 13.31 | 0.4183 | 21.54 |
| q4_1 | 0.8202 | 25.48 | 0.4010 | 41.75 |
| sbert | 0.8309 | 4.98 | 0.4117 | 10.45 |
| sbert-batchless | 0.8309 | 22.22 | 0.4117 | 26.53 |

### bert-base-uncased
bert-base-uncased is not a very good sentence embeddings model, but it's here to show that bert.cpp correctly runs models that are not from SentenceTransformers. Technically any hf model with architecture `BertModel` or `BertForMaskedLM` should work.
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
|-----------|-----------|------------|-----------|------------|
| f16 | 0.4739 | 37.68 | 0.3361 | 61.54 |
| f32 | 0.4738 | 57.90 | 0.3361 | 91.37 |
| q4_0 | 0.4940 | 39.21 | 0.3375 | 65.11 |
| q4_1 | 0.4681 | 85.11 | 0.3268 | 144.11 |
| sbert | 0.4729 | 16.71 | 0.3527 | 30.03 |
| sbert-batchless | 0.4729 | 67.12 | 0.3526 | 77.83 |

41 changes: 41 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Use `run_mteb.py` to run mteb embeddings benchmark for each model. The script will start the c++ server for each different size of model, so make sure you have all 4 sizes in your models directory. It will also run the benchmarks with SentenceTransformers library to get a baseline results.

The ggml version doesn't have batching so it is at a disadvantage compared to sbert where all the computations are done in batches of 64 input sentences. But if batching is not possible in your application (e.g. the inputs are given by the user) then the batchless performance is more relevant. sbert-batchless runs the benchmark with SentenceTransformers library with `batch_size=1`

Note that the sbert results here are with CPU. Sbert also supports GPU inference, and in that case it would be much faster.

Use `print_tables.py` to format the results like the following tables.

### all-MiniLM-L6-v2
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
|-----------|-----------|------------|-----------|------------|
| f16 | 0.8201 | 7.52 | 0.4085 | 12.25 |
| f32 | 0.8201 | 8.22 | 0.4082 | 13.65 |
| q4_0 | 0.8175 | 6.87 | 0.3911 | 11.22 |
| q4_1 | 0.8214 | 13.26 | 0.4015 | 21.37 |
| sbert | 0.8203 | 2.85 | 0.4085 | 7.28 |
| sbert-batchless | 0.8203 | 12.48 | 0.4085 | 15.27 |


### all-MiniLM-L12-v2
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
|-----------|-----------|------------|-----------|------------|
| f16 | 0.8306 | 14.66 | 0.4119 | 23.20 |
| f32 | 0.8306 | 16.18 | 0.4117 | 25.79 |
| q4_0 | 0.8310 | 13.31 | 0.4183 | 21.54 |
| q4_1 | 0.8202 | 25.48 | 0.4010 | 41.75 |
| sbert | 0.8309 | 4.98 | 0.4117 | 10.45 |
| sbert-batchless | 0.8309 | 22.22 | 0.4117 | 26.53 |


### bert-base-uncased
For bert-base-uncased, the pooling and normalization are different from the ones used in the actual model. I think that's why ggml scores better than sbert in STSBenchmark and worse in EmotionClassification
| Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
|-----------|-----------|------------|-----------|------------|
| f16 | 0.4739 | 37.68 | 0.3361 | 61.54 |
| f32 | 0.4738 | 57.90 | 0.3361 | 91.37 |
| q4_0 | 0.4940 | 39.21 | 0.3375 | 65.11 |
| q4_1 | 0.4681 | 85.11 | 0.3268 | 144.11 |
| sbert | 0.4729 | 16.71 | 0.3527 | 30.03 |
| sbert-batchless | 0.4729 | 67.12 | 0.3526 | 77.83 |

62 changes: 62 additions & 0 deletions benchmarks/print_tables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import os
import json

RESULTS_DIR = "results"
BENCHMARKS = ["STSBenchmark", "EmotionClassification"]
DATA_TYPES = ["f16", "f32", "q4_0", "q4_1", "sbert", "sbert-batchless"]

# Define a dictionary to store the results
results_dict = {}

# Loop over all the directories and extract the models
models = set()
for dir_name in os.listdir(RESULTS_DIR):
m = dir_name.split("_")[0]
models.add(m)

def extract_results(test_data):
res = {"time": test_data["evaluation_time"]}
if "cos_sim" in test_data and "spearman" in test_data["cos_sim"]:
res['score'] = test_data["cos_sim"]["spearman"]
elif "main_score" in test_data:
res['score'] = test_data["main_score"]
else:
print(f"can't extract results {test_data}")
return res

for model in models:
model_results = {}
for data_type in DATA_TYPES:
dir_name = f"{RESULTS_DIR}/{model}_{data_type}"
if not os.path.isdir(dir_name):
print(f"{dir_name} doesn't exist!")
continue
data_type_results = {}
for benchmark in BENCHMARKS:
results_path = os.path.join(dir_name, f"{benchmark}.json")
with open(results_path, "r") as f:
results = json.load(f)

data_type_results[benchmark] = extract_results(results['test'])

model_results[data_type] = data_type_results
results_dict[model] = model_results

# Print the results as an .md table for each model
for model, model_results in results_dict.items():
print(f"### {model}")
print("| Data Type | ", end="")
for benchmark in BENCHMARKS:
print(f"{benchmark} | eval time | ", end="")
print()
print("|-----------|", end="")
for _ in BENCHMARKS:
print("-----------|------------|", end="")
print()
for data_type in DATA_TYPES:
print(f"| {data_type} | ", end="")
for benchmark in BENCHMARKS:
results = model_results[data_type][benchmark]
print(f"{results['score']:.4f} | {results['time']:.2f} | ", end="")
print()
print("\n")
2 changes: 2 additions & 0 deletions benchmarks/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mteb
sentence_transformers
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
"mteb_dataset_name": "EmotionClassification",
"mteb_version": "1.0.2",
"test": {
"accuracy": 0.4119499999999999,
"accuracy_stderr": 0.025105228539091216,
"evaluation_time": 23.2,
"f1": 0.36981414412336655,
"f1_stderr": 0.02094871267575925,
"main_score": 0.4119499999999999
}
}
20 changes: 20 additions & 0 deletions benchmarks/results/all-MiniLM-L12-v2_f16/STSBenchmark.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"mteb_dataset_name": "STSBenchmark",
"mteb_version": "1.0.2",
"test": {
"cos_sim": {
"pearson": 0.8374641693018909,
"spearman": 0.8305896485864188
},
"euclidean": {
"pearson": 0.8350326075472255,
"spearman": 0.8305896485864188
},
"evaluation_time": 14.66,
"manhattan": {
"pearson": 0.8351482035115159,
"spearman": 0.8308811375478211
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
"mteb_dataset_name": "EmotionClassification",
"mteb_version": "1.0.2",
"test": {
"accuracy": 0.41174999999999995,
"accuracy_stderr": 0.02517364693484041,
"evaluation_time": 25.79,
"f1": 0.36964632574873646,
"f1_stderr": 0.02101215083642815,
"main_score": 0.41174999999999995
}
}
20 changes: 20 additions & 0 deletions benchmarks/results/all-MiniLM-L12-v2_f32/STSBenchmark.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"mteb_dataset_name": "STSBenchmark",
"mteb_version": "1.0.2",
"test": {
"cos_sim": {
"pearson": 0.837465240168285,
"spearman": 0.8305951440128178
},
"euclidean": {
"pearson": 0.835033461743598,
"spearman": 0.8305951440128178
},
"evaluation_time": 16.18,
"manhattan": {
"pearson": 0.8351470693555814,
"spearman": 0.8308846560867743
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
"mteb_dataset_name": "EmotionClassification",
"mteb_version": "1.0.2",
"test": {
"accuracy": 0.4183,
"accuracy_stderr": 0.021613884426451443,
"evaluation_time": 21.54,
"f1": 0.37624466895950653,
"f1_stderr": 0.01743903163262402,
"main_score": 0.4183
}
}
20 changes: 20 additions & 0 deletions benchmarks/results/all-MiniLM-L12-v2_q4_0/STSBenchmark.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"mteb_dataset_name": "STSBenchmark",
"mteb_version": "1.0.2",
"test": {
"cos_sim": {
"pearson": 0.8365276911292119,
"spearman": 0.8309588798492489
},
"euclidean": {
"pearson": 0.8372279220677411,
"spearman": 0.8309588798492489
},
"evaluation_time": 13.31,
"manhattan": {
"pearson": 0.8368693263995872,
"spearman": 0.8306785947771824
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
"mteb_dataset_name": "EmotionClassification",
"mteb_version": "1.0.2",
"test": {
"accuracy": 0.40095000000000003,
"accuracy_stderr": 0.02566266743734953,
"evaluation_time": 41.75,
"f1": 0.3626628620864726,
"f1_stderr": 0.018959571169492463,
"main_score": 0.40095000000000003
}
}
20 changes: 20 additions & 0 deletions benchmarks/results/all-MiniLM-L12-v2_q4_1/STSBenchmark.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"mteb_dataset_name": "STSBenchmark",
"mteb_version": "1.0.2",
"test": {
"cos_sim": {
"pearson": 0.8300376055771063,
"spearman": 0.8202182350295162
},
"euclidean": {
"pearson": 0.8281548958602518,
"spearman": 0.8202182350295162
},
"evaluation_time": 25.48,
"manhattan": {
"pearson": 0.8272951345188557,
"spearman": 0.819294554414274
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
"mteb_dataset_name": "EmotionClassification",
"mteb_version": "1.0.2",
"test": {
"accuracy": 0.4117,
"accuracy_stderr": 0.025096015620014265,
"evaluation_time": 26.53,
"f1": 0.3696192637393597,
"f1_stderr": 0.020941989472486138,
"main_score": 0.4117
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
"mteb_dataset_name": "STSBenchmark",
"mteb_version": "1.0.2",
"test": {
"cos_sim": {
"pearson": 0.837594560292421,
"spearman": 0.8308938533093635
},
"euclidean": {
"pearson": 0.8355879778009024,
"spearman": 0.8308938533093635
},
"evaluation_time": 22.22,
"manhattan": {
"pearson": 0.8356896375814314,
"spearman": 0.8311516183577004
}
}
}
Loading

0 comments on commit 24f2bc2

Please sign in to comment.