Add benchmark results and readme

xyzhang626 · Apr 27, 2023 · 24f2bc2 · 24f2bc2
1 parent 9291207
commit 24f2bc2
Show file tree

Hide file tree

Showing 39 changed files with 2,386 additions and 97 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,97 @@
+# bert.cpp
+
+[ggml](https://github.com/ggerganov/ggml) inference of BERT neural net architecture with pooling and normalization from [SentenceTransformers (sbert.net)](https://sbert.net/).
+High quality sentence embeddings in pure C++ (or C).
+
+## Description
+The main goal of `bert.cpp` is to run the BERT model using 4-bit integer quantization on CPU
+
+* Plain C/C++ implementation without dependencies
+* Inherit support for various architectures from ggml (x86 with AVX2, ARM, etc.)
+* Choose your model size from 32/16/4 bits per model weigth
+* all-MiniLM-L6-v2 with 4bit quantization is only 14MB. Inference RAM usage depends on the length of the input
+* Sample cpp server over tcp socket and a python test client
+* Benchmarks to validate correctness and speed of inference
+
+## Limitations & TODO
+* Tokenizer doesn't correctly handle asian writing (CJK, maybe others)
+* Inputs longer than ctx size are not truncated. If you are trying to make embeddings for longer texts make sure to truncate.
+* bert.cpp doesn't respect tokenizer, pooling or normalization settings from the model card:
+    * All inputs are lowercased and trimmed
+    * All outputs are mean pooled and normalized
+* The API is in C++ (uses things from std::)
+
+## Usage
+
+### Build
+```sh
+mkdir build
+cd build
+cmake ..
+make
+cd ..
+```
+### Download models
+```sh
+pip3 install -r requirements.txt
+# python3 models/download-ggml.py list_models
+python3 models/download-ggml.py download all-MiniLM-L6-v2 q4_0
+```
+### Start sample server
+```sh
+./build/bin/server -m models/all-MiniLM-L6-v2/ggml-model-q4_0.bin 
+
+# bert_model_load: loading model from 'models/all-MiniLM-L6-v2/ggml-model-q4_0.bin' - please wait ...
+# bert_model_load: n_vocab = 30522
+# bert_model_load: n_ctx   = 512
+# bert_model_load: n_embd  = 384
+# bert_model_load: n_intermediate  = 1536
+# bert_model_load: n_head  = 12
+# bert_model_load: n_layer = 6
+# bert_model_load: f16     = 2
+# bert_model_load: ggml ctx size =  13.57 MB
+# bert_model_load: ............ done
+# bert_model_load: model size =    13.55 MB / num tensors = 101
+# Server running on port 8080 with 4 threads
+```
+### Run sample client
+
+
+
+
+## Benchmarks
+Running MTEB (Massive Text Embedding Benchmark) with bert.cpp vs. [sbert](https://sbert.net/)(cpu mode) gives comparable results between the two, with quantization having minimal effect on accuracy and eval time being similar or better than sbert with batch_size=1 (bert.cpp doesn't support batching).
+
+See [benchmarks](benchmarks) more info.
+### all-MiniLM-L6-v2
+| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
+|-----------|-----------|------------|-----------|------------|
+| f16 | 0.8201 | 7.52 | 0.4085 | 12.25 | 
+| f32 | 0.8201 | 8.22 | 0.4082 | 13.65 | 
+| q4_0 | 0.8175 | 6.87 | 0.3911 | 11.22 | 
+| q4_1 | 0.8214 | 13.26 | 0.4015 | 21.37 | 
+| sbert | 0.8203 | 2.85 | 0.4085 | 7.28 | 
+| sbert-batchless | 0.8203 | 12.48 | 0.4085 | 15.27 | 
+
+
+### all-MiniLM-L12-v2
+| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
+|-----------|-----------|------------|-----------|------------|
+| f16 | 0.8306 | 14.66 | 0.4119 | 23.20 | 
+| f32 | 0.8306 | 16.18 | 0.4117 | 25.79 | 
+| q4_0 | 0.8310 | 13.31 | 0.4183 | 21.54 | 
+| q4_1 | 0.8202 | 25.48 | 0.4010 | 41.75 | 
+| sbert | 0.8309 | 4.98 | 0.4117 | 10.45 | 
+| sbert-batchless | 0.8309 | 22.22 | 0.4117 | 26.53 | 
+
+### bert-base-uncased
+bert-base-uncased is not a very good sentence embeddings model, but it's here to show that bert.cpp correctly runs models that are not from SentenceTransformers. Technically any hf model with architecture `BertModel` or `BertForMaskedLM` should work.
+| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
+|-----------|-----------|------------|-----------|------------|
+| f16 | 0.4739 | 37.68 | 0.3361 | 61.54 | 
+| f32 | 0.4738 | 57.90 | 0.3361 | 91.37 | 
+| q4_0 | 0.4940 | 39.21 | 0.3375 | 65.11 | 
+| q4_1 | 0.4681 | 85.11 | 0.3268 | 144.11 | 
+| sbert | 0.4729 | 16.71 | 0.3527 | 30.03 | 
+| sbert-batchless | 0.4729 | 67.12 | 0.3526 | 77.83 | 
+
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,41 @@
+Use `run_mteb.py` to run mteb embeddings benchmark for each model. The script will start the c++ server for each different size of model, so make sure you have all 4 sizes in your models directory. It will also run the benchmarks with SentenceTransformers library to get a baseline results.
+
+The ggml version doesn't have batching so it is at a disadvantage compared to sbert where all the computations are done in batches of 64 input sentences. But if batching is not possible in your application (e.g. the inputs are given by the user) then the batchless performance is more relevant. sbert-batchless runs the benchmark with SentenceTransformers library with `batch_size=1`
+
+Note that the sbert results here are with CPU. Sbert also supports GPU inference, and in that case it would be much faster.
+
+Use `print_tables.py` to format the results like the following tables.
+
+### all-MiniLM-L6-v2
+| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
+|-----------|-----------|------------|-----------|------------|
+| f16 | 0.8201 | 7.52 | 0.4085 | 12.25 | 
+| f32 | 0.8201 | 8.22 | 0.4082 | 13.65 | 
+| q4_0 | 0.8175 | 6.87 | 0.3911 | 11.22 | 
+| q4_1 | 0.8214 | 13.26 | 0.4015 | 21.37 | 
+| sbert | 0.8203 | 2.85 | 0.4085 | 7.28 | 
+| sbert-batchless | 0.8203 | 12.48 | 0.4085 | 15.27 | 
+
+
+### all-MiniLM-L12-v2
+| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
+|-----------|-----------|------------|-----------|------------|
+| f16 | 0.8306 | 14.66 | 0.4119 | 23.20 | 
+| f32 | 0.8306 | 16.18 | 0.4117 | 25.79 | 
+| q4_0 | 0.8310 | 13.31 | 0.4183 | 21.54 | 
+| q4_1 | 0.8202 | 25.48 | 0.4010 | 41.75 | 
+| sbert | 0.8309 | 4.98 | 0.4117 | 10.45 | 
+| sbert-batchless | 0.8309 | 22.22 | 0.4117 | 26.53 | 
+
+
+### bert-base-uncased
+For bert-base-uncased, the pooling and normalization are different from the ones used in the actual model. I think that's why ggml scores better than sbert in STSBenchmark and worse in EmotionClassification
+| Data Type | STSBenchmark | eval time | EmotionClassification | eval time | 
+|-----------|-----------|------------|-----------|------------|
+| f16 | 0.4739 | 37.68 | 0.3361 | 61.54 | 
+| f32 | 0.4738 | 57.90 | 0.3361 | 91.37 | 
+| q4_0 | 0.4940 | 39.21 | 0.3375 | 65.11 | 
+| q4_1 | 0.4681 | 85.11 | 0.3268 | 144.11 | 
+| sbert | 0.4729 | 16.71 | 0.3527 | 30.03 | 
+| sbert-batchless | 0.4729 | 67.12 | 0.3526 | 77.83 | 
+
diff --git a/benchmarks/print_tables.py b/benchmarks/print_tables.py
@@ -0,0 +1,62 @@
+import os
+import json
+
+RESULTS_DIR = "results"
+BENCHMARKS = ["STSBenchmark", "EmotionClassification"]
+DATA_TYPES = ["f16", "f32", "q4_0", "q4_1", "sbert", "sbert-batchless"]
+
+# Define a dictionary to store the results
+results_dict = {}
+
+# Loop over all the directories and extract the models
+models = set()
+for dir_name in os.listdir(RESULTS_DIR):
+    m = dir_name.split("_")[0]
+    models.add(m)
+
+def extract_results(test_data):
+    res = {"time": test_data["evaluation_time"]}
+    if "cos_sim" in test_data and "spearman" in test_data["cos_sim"]:
+        res['score'] = test_data["cos_sim"]["spearman"]
+    elif "main_score" in test_data:
+        res['score'] = test_data["main_score"]
+    else:
+        print(f"can't extract results {test_data}")
+    return res
+
+for model in models:
+    model_results = {}
+    for data_type in DATA_TYPES:
+        dir_name = f"{RESULTS_DIR}/{model}_{data_type}"
+        if not os.path.isdir(dir_name):
+            print(f"{dir_name} doesn't exist!")
+            continue
+        data_type_results = {}
+        for benchmark in BENCHMARKS:
+            results_path = os.path.join(dir_name, f"{benchmark}.json")
+            with open(results_path, "r") as f:
+                results = json.load(f)
+
+            data_type_results[benchmark] = extract_results(results['test'])
+
+        model_results[data_type] = data_type_results
+    results_dict[model] = model_results
+
+# Print the results as an .md table for each model
+for model, model_results in results_dict.items():
+    print(f"### {model}")
+    print("| Data Type | ", end="")
+    for benchmark in BENCHMARKS:
+        print(f"{benchmark} | eval time | ", end="")
+    print()
+    print("|-----------|", end="")
+    for _ in BENCHMARKS:
+        print("-----------|------------|", end="")
+    print()
+    for data_type in DATA_TYPES:
+        print(f"| {data_type} | ", end="")
+        for benchmark in BENCHMARKS:
+            results = model_results[data_type][benchmark]
+            print(f"{results['score']:.4f} | {results['time']:.2f} | ", end="")
+        print()
+    print("\n")
diff --git a/benchmarks/requirements.txt b/benchmarks/requirements.txt
@@ -0,0 +1,2 @@
+mteb
+sentence_transformers
diff --git a/benchmarks/results/all-MiniLM-L12-v2_f16/EmotionClassification.json b/benchmarks/results/all-MiniLM-L12-v2_f16/EmotionClassification.json
@@ -0,0 +1,13 @@
+{
+  "dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
+  "mteb_dataset_name": "EmotionClassification",
+  "mteb_version": "1.0.2",
+  "test": {
+    "accuracy": 0.4119499999999999,
+    "accuracy_stderr": 0.025105228539091216,
+    "evaluation_time": 23.2,
+    "f1": 0.36981414412336655,
+    "f1_stderr": 0.02094871267575925,
+    "main_score": 0.4119499999999999
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_f16/STSBenchmark.json b/benchmarks/results/all-MiniLM-L12-v2_f16/STSBenchmark.json
@@ -0,0 +1,20 @@
+{
+  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
+  "mteb_dataset_name": "STSBenchmark",
+  "mteb_version": "1.0.2",
+  "test": {
+    "cos_sim": {
+      "pearson": 0.8374641693018909,
+      "spearman": 0.8305896485864188
+    },
+    "euclidean": {
+      "pearson": 0.8350326075472255,
+      "spearman": 0.8305896485864188
+    },
+    "evaluation_time": 14.66,
+    "manhattan": {
+      "pearson": 0.8351482035115159,
+      "spearman": 0.8308811375478211
+    }
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_f32/EmotionClassification.json b/benchmarks/results/all-MiniLM-L12-v2_f32/EmotionClassification.json
@@ -0,0 +1,13 @@
+{
+  "dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
+  "mteb_dataset_name": "EmotionClassification",
+  "mteb_version": "1.0.2",
+  "test": {
+    "accuracy": 0.41174999999999995,
+    "accuracy_stderr": 0.02517364693484041,
+    "evaluation_time": 25.79,
+    "f1": 0.36964632574873646,
+    "f1_stderr": 0.02101215083642815,
+    "main_score": 0.41174999999999995
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_f32/STSBenchmark.json b/benchmarks/results/all-MiniLM-L12-v2_f32/STSBenchmark.json
@@ -0,0 +1,20 @@
+{
+  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
+  "mteb_dataset_name": "STSBenchmark",
+  "mteb_version": "1.0.2",
+  "test": {
+    "cos_sim": {
+      "pearson": 0.837465240168285,
+      "spearman": 0.8305951440128178
+    },
+    "euclidean": {
+      "pearson": 0.835033461743598,
+      "spearman": 0.8305951440128178
+    },
+    "evaluation_time": 16.18,
+    "manhattan": {
+      "pearson": 0.8351470693555814,
+      "spearman": 0.8308846560867743
+    }
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_q4_0/EmotionClassification.json b/benchmarks/results/all-MiniLM-L12-v2_q4_0/EmotionClassification.json
@@ -0,0 +1,13 @@
+{
+  "dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
+  "mteb_dataset_name": "EmotionClassification",
+  "mteb_version": "1.0.2",
+  "test": {
+    "accuracy": 0.4183,
+    "accuracy_stderr": 0.021613884426451443,
+    "evaluation_time": 21.54,
+    "f1": 0.37624466895950653,
+    "f1_stderr": 0.01743903163262402,
+    "main_score": 0.4183
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_q4_0/STSBenchmark.json b/benchmarks/results/all-MiniLM-L12-v2_q4_0/STSBenchmark.json
@@ -0,0 +1,20 @@
+{
+  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
+  "mteb_dataset_name": "STSBenchmark",
+  "mteb_version": "1.0.2",
+  "test": {
+    "cos_sim": {
+      "pearson": 0.8365276911292119,
+      "spearman": 0.8309588798492489
+    },
+    "euclidean": {
+      "pearson": 0.8372279220677411,
+      "spearman": 0.8309588798492489
+    },
+    "evaluation_time": 13.31,
+    "manhattan": {
+      "pearson": 0.8368693263995872,
+      "spearman": 0.8306785947771824
+    }
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_q4_1/EmotionClassification.json b/benchmarks/results/all-MiniLM-L12-v2_q4_1/EmotionClassification.json
@@ -0,0 +1,13 @@
+{
+  "dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
+  "mteb_dataset_name": "EmotionClassification",
+  "mteb_version": "1.0.2",
+  "test": {
+    "accuracy": 0.40095000000000003,
+    "accuracy_stderr": 0.02566266743734953,
+    "evaluation_time": 41.75,
+    "f1": 0.3626628620864726,
+    "f1_stderr": 0.018959571169492463,
+    "main_score": 0.40095000000000003
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_q4_1/STSBenchmark.json b/benchmarks/results/all-MiniLM-L12-v2_q4_1/STSBenchmark.json
@@ -0,0 +1,20 @@
+{
+  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
+  "mteb_dataset_name": "STSBenchmark",
+  "mteb_version": "1.0.2",
+  "test": {
+    "cos_sim": {
+      "pearson": 0.8300376055771063,
+      "spearman": 0.8202182350295162
+    },
+    "euclidean": {
+      "pearson": 0.8281548958602518,
+      "spearman": 0.8202182350295162
+    },
+    "evaluation_time": 25.48,
+    "manhattan": {
+      "pearson": 0.8272951345188557,
+      "spearman": 0.819294554414274
+    }
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_sbert-batchless/EmotionClassification.json b/benchmarks/results/all-MiniLM-L12-v2_sbert-batchless/EmotionClassification.json
@@ -0,0 +1,13 @@
+{
+  "dataset_revision": "4f58c6b202a23cf9a4da393831edf4f9183cad37",
+  "mteb_dataset_name": "EmotionClassification",
+  "mteb_version": "1.0.2",
+  "test": {
+    "accuracy": 0.4117,
+    "accuracy_stderr": 0.025096015620014265,
+    "evaluation_time": 26.53,
+    "f1": 0.3696192637393597,
+    "f1_stderr": 0.020941989472486138,
+    "main_score": 0.4117
+  }
+}
diff --git a/benchmarks/results/all-MiniLM-L12-v2_sbert-batchless/STSBenchmark.json b/benchmarks/results/all-MiniLM-L12-v2_sbert-batchless/STSBenchmark.json
@@ -0,0 +1,20 @@
+{
+  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
+  "mteb_dataset_name": "STSBenchmark",
+  "mteb_version": "1.0.2",
+  "test": {
+    "cos_sim": {
+      "pearson": 0.837594560292421,
+      "spearman": 0.8308938533093635
+    },
+    "euclidean": {
+      "pearson": 0.8355879778009024,
+      "spearman": 0.8308938533093635
+    },
+    "evaluation_time": 22.22,
+    "manhattan": {
+      "pearson": 0.8356896375814314,
+      "spearman": 0.8311516183577004
+    }
+  }
+}