Skip to content

Latest commit

 

History

History
123 lines (97 loc) · 5.16 KB

README.md

File metadata and controls

123 lines (97 loc) · 5.16 KB

Peter's LLM Leaderboard

Evaluating the capabilities of large language models is very difficult. There are already many public leaderboards that do this work. However, public leaderboards are often prone to malicious manipulation and some evaluation benchmarks are not suitable for real application scenarios. So I decided to create my own assessment benchmark and evaluate of my favourite models.

Leaderboard

Model Total Knowledge Coding Censorship Instruction Math Extraction Reasoning Summarizing Writing
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf 55 6 8 5 6 5 7 8 6 4
miqu-1-70b-iq2_xs.gguf 54 7 8 6 6 3 6 8 6 4
Smaug-34B-v0.1_Q4_K_M.gguf 53 7 8 6 5 4 6 7 6 4
Starling-LM-7B-beta-Q8_0.gguf 52 6 8 6 6 5 7 5 6 3
openchat-3.5-0106.Q8_0.gguf 52 7 8 6 6 5 7 4 6 3
senku-70b-iq2_xxs.gguf 51 6 8 6 7 5 6 4 6 3
Hermes-2-Pro-Mistral-7B.Q8_0.gguf 48 6 8 4 5 5 6 6 6 2
Nous-Hermes-2-Mistral-7B-DPO.Q8_0.gguf 46 6 8 5 4 4 6 4 6 3
nous-capybara-34b.Q4_K_M.gguf 46 6 6 3 6 3 7 5 6 4
gemma-7b-it.Q8_0.gguf 44 6 7 6 5 4 5 2 6 3
gemma-2b-it.Q8_0.gguf 36 3 7 6 3 2 2 4 6 3
phi-2.Q8_0.gguf 26 6 5 5 3 3 1 2 1 0
qwen1_5-1_8b-chat-q8_0.gguf 25 3 5 3 2 1 5 2 2 2

Note:

  • Due to limitations of my GPU (24G VRAM), I can only run quantized models, so the performance should be lower that the original models.

Detailed Results

Evaluation Questions

I collected 61 test questions from the Internet, it includes:

Download Models

Model Info

Model Size Required VRAM Required GPUs
miqu-1-70b-iq2_xs.gguf 19G 23.8G >= RTX-3090
senku-70b-iq2_xxs.gguf 20G 22G >= RTX-3090
Smaug-34B-v0.1_Q4_K_M.gguf 20G 23.8G >= RTX-3090
nous-capybara-34b.Q4_K_M.gguf 20G 23.8G >= RTX-3090
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf 28.4G >24G >= RTX-3090
openchat-3.5-0106.Q8_0.gguf 7.7G 9.4G >= RTX-3070
Starling-LM-7B-beta-Q8_0.gguf 7.7G 9.4G >= RTX-3070
gemma-7b-it.Q8_0.gguf 9.1G 15G >= RTX-3080
gemma-2b-it.Q8_0.gguf >= RTX-3070
phi-2.Q8_0.gguf >= RTX-3070
qwen1_5-1_8b-chat-q8_0.gguf >= RTX-3070

Evaluation Platform

  • GeForce RTX 4090 (24G VRAM)
  • Intel I9-14900K
  • 64G RAM
  • Ubuntu 22.04
  • Python 3.10
  • llama-cpp-python

Run in local env

1. Install Dependencies

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt

2. Download Models

Download models from huggingface and put gguf files to models folder.

3. Create model config file

Create model config file in models folder, here is an example:

{
  "name": "gemma-2b",
  "chatFormat": "gemma",
  "modelPath": "gemma-2b-it.Q8_0.gguf",
  "context": 8192
}

4. Evaluation

python evaluate.py -m models/gemma-7b-it.json