Peter's LLM Leaderboard

Evaluating the capabilities of large language models is very difficult. There are already many public leaderboards that do this work. However, public leaderboards are often prone to malicious manipulation and some evaluation benchmarks are not suitable for real application scenarios. So I decided to create my own assessment benchmark and evaluate of my favourite models.

Leaderboard

Model	Total	Knowledge	Coding	Censorship	Instruction	Math	Extraction	Reasoning	Summarizing	Writing
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf	55	6	8	5	6	5	7	8	6	4
miqu-1-70b-iq2_xs.gguf	54	7	8	6	6	3	6	8	6	4
Smaug-34B-v0.1_Q4_K_M.gguf	53	7	8	6	5	4	6	7	6	4
Starling-LM-7B-beta-Q8_0.gguf	52	6	8	6	6	5	7	5	6	3
openchat-3.5-0106.Q8_0.gguf	52	7	8	6	6	5	7	4	6	3
senku-70b-iq2_xxs.gguf	51	6	8	6	7	5	6	4	6	3
Hermes-2-Pro-Mistral-7B.Q8_0.gguf	48	6	8	4	5	5	6	6	6	2
Nous-Hermes-2-Mistral-7B-DPO.Q8_0.gguf	46	6	8	5	4	4	6	4	6	3
nous-capybara-34b.Q4_K_M.gguf	46	6	6	3	6	3	7	5	6	4
gemma-7b-it.Q8_0.gguf	44	6	7	6	5	4	5	2	6	3
gemma-2b-it.Q8_0.gguf	36	3	7	6	3	2	2	4	6	3
phi-2.Q8_0.gguf	26	6	5	5	3	3	1	2	1	0
qwen1_5-1_8b-chat-q8_0.gguf	25	3	5	3	2	1	5	2	2	2

Note:

Due to limitations of my GPU (24G VRAM), I can only run quantized models, so the performance should be lower that the original models.

Detailed Results

Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf
miqu-1-70b-iq2_xs.gguf
Smaug-34B-v0.1_Q4_K_M.gguf
senku-70b-iq2_xxs.gguf
openchat-3.5-0106.Q8_0.gguf
Starling-LM-7B-beta-Q8_0.gguf
nous-capybara-34b.Q4_K_M.gguf
gemma-7b-it.Q8_0.gguf
gemma-2b-it.Q8_0.gguf
phi-2.Q8_0.gguf
qwen1_5-1_8b-chat-q8_0.gguf

Evaluation Questions

I collected 61 test questions from the Internet, it includes:

Knowledge (7)
Coding (8)
Censorship (6)
Instruction (6)
Math (7)
Extraction (7)
Reasoning (10)
Summarizing (6)
Writing (4)

Download Models

Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf
miqu-1-70b-iq2_xs.gguf
Smaug-34B-v0.1_Q4_K_M.gguf
Starling-LM-7B-beta-Q8_0.gguf
openchat-3.5-0106.Q8_0.gguf
senku-70b-iq2_xxs.gguf
nous-capybara-34b.Q4_K_M.gguf
gemma-7b-it.Q8_0.gguf
gemma-2b-it.Q8_0.gguf
qwen1_5-1_8b-chat-q8_0.gguf
phi-2.Q8_0.gguf

Model Info

Model	Size	Required VRAM	Required GPUs
miqu-1-70b-iq2_xs.gguf	19G	23.8G	>= RTX-3090
senku-70b-iq2_xxs.gguf	20G	22G	>= RTX-3090
Smaug-34B-v0.1_Q4_K_M.gguf	20G	23.8G	>= RTX-3090
nous-capybara-34b.Q4_K_M.gguf	20G	23.8G	>= RTX-3090
Nous-Hermes-2-Mixtral-8x7B-DPO.Q4_K_M.gguf	28.4G	>24G	>= RTX-3090
openchat-3.5-0106.Q8_0.gguf	7.7G	9.4G	>= RTX-3070
Starling-LM-7B-beta-Q8_0.gguf	7.7G	9.4G	>= RTX-3070
gemma-7b-it.Q8_0.gguf	9.1G	15G	>= RTX-3080
gemma-2b-it.Q8_0.gguf			>= RTX-3070
phi-2.Q8_0.gguf			>= RTX-3070
qwen1_5-1_8b-chat-q8_0.gguf			>= RTX-3070

Evaluation Platform

GeForce RTX 4090 (24G VRAM)
Intel I9-14900K
64G RAM
Ubuntu 22.04
Python 3.10
llama-cpp-python

Run in local env

1. Install Dependencies

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install -r requirements.txt

2. Download Models

Download models from huggingface and put gguf files to models folder.

3. Create model config file

Create model config file in models folder, here is an example:

{
  "name": "gemma-2b",
  "chatFormat": "gemma",
  "modelPath": "gemma-2b-it.Q8_0.gguf",
  "context": 8192
}

4. Evaluation

python evaluate.py -m models/gemma-7b-it.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Peter's LLM Leaderboard

Leaderboard

Detailed Results

Evaluation Questions

Download Models

Model Info

Evaluation Platform

Run in local env

1. Install Dependencies

2. Download Models

3. Create model config file

4. Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Peter's LLM Leaderboard

Leaderboard

Detailed Results

Evaluation Questions

Download Models

Model Info

Evaluation Platform

Run in local env

1. Install Dependencies

2. Download Models

3. Create model config file

4. Evaluation