A comprehensive code domain benchmark review of LLM researches.
-
š„š„ [2025-04-21] Featured Benchmarks:
š„ SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents from AWS AI Labs
-
[2025-04-18] We add Github Stars for each banchmark.
-
[2025-04-13] We add Code Security & Robustness benchmarks.Ā
-
[2025-04-06] We add Code Hallucinations benchmarks.Ā
-
[2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā
-
[2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā
-
[2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā
- Code Completion & Code Generation
- Code Efficiency
- CodeFix & Bug-Fix
- Code Reasoning & Understanding
- Code Hallucination
- Data science
- Text2SQL
- MultiModal Code Tasks
- Code Security & Robustness
- Code Translation
- Code Version
- Multi & Other Dimension
- Industry Code Generation
Details of Code Completion & Code Generation Benchmarks :: click to expand ::
- HumanEval: code completion
- MBPP: text -> code; code generation
- APPS: a benchmark for code generation from natural language specifications
- CodeContests: complex programming task
- MultiPL-E: extends the HumanEval and MBPP benchmarks to 18 languages
- MCoNaLa: code generation from multiple natural languages
- LCC: long code context code completion
- CodeClarQA: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
- EvalPlus: extends the HumanEval and MBPP benchmarks
- CrossCodeEval: diverse and multilingual benchmark for cross-file code completion
- ODEX: open-domain, execution-based natural language to code generation
- HumanEval-X: multilingual ability of code generation
- ML-Bench: repo-level ML task solving benchmark using real-world code
- RepoBench: repo-level code auto-completion
- CatCoder: a framework for repo-level code generation in statically typed languages using code and type context
- StudentEval: a benchmark of student-written prompts for code generation evaluation
- DevEval: repo-level code generation
- CoderEval: pragmatic code generation
- ConCodeEval: benchmark for assessing LLMs' understanding of code constraints in domain-specific languages like JSON and YAML
- CodeScope: Multilingual Multitask Multidimensional Benchmark (execution-based)
- OOP: object-oriented programming evaluation benchmark of python programs
- L2CEval: multilingual, multi-task NL-to-code benchmark including semantic parsing, math reasoning and Python programming.
- HumanExtensionļ¼auxiliary-function-based code generation benchmark
- LLM4Decompile: benchmark for evaluating binary-to-C decompilation on real-world open-source binaries
- PYCOMMITS: multi-round Python code editing benchmark from real commit histories
- CodeAgentBench: repo-level code generation benchmark with tool-integrated agents for real-world tasks
- SAFIM: syntax-aware code completion benchmark focusing on code blocks and conditional expressionsā
- BigCodeBench: complete Split & Instruct Split
- EvoCodeBench: evolving Python code generation benchmark from real GitHub commits
- DynaCode: a dynamic complexity-aware code benchmark
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
EvalPerf | Evaluating Language Models for Efficient Code Generation | COLM 2024 | Github |
š¤Dataset šWebsite |
EffiBench | EffiBench: Benchmarking the Efficiency of Automatically Generated Code | NeurIPS 2024 | Github |
|
Mercury | Mercury: A Code Efficiency Benchmark for Code Large Language Models | NeurIPS 2024 | Github |
š¤Dataset |
ECCO | ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? | EMNLP 2024 | Github |
š¤Dataset |
PIE | Learning Performance-Improving Code Edits | ICLR 2024 | Github |
šWebsite |
ENAMEL | How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark | ICLR 2025 | Github |
š¤Dataset |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
GenCodeSearchNet | GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding | EMNLP 2023 | Github |
š¤Dataset |
CRUXEval | CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution | Arxiv 2024/01 | Github |
šLeaderBoard |
Poor-CodeSumEval | How Effectively Do Code Language Models Understand Poor-Readability Code? | ASE 2024 | Github |
š¤Dataset |
CodeScope | CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation | ACL 2024 | Github |
šLeaderBoard š¤Dataset |
CodeJudge-Eval | CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? | COLING 2025 | Github |
|
CodeMMLU | CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs | ICLR 2025 | Github |
š¤Dataset šWebsite šLeaderBoard |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
HALLUCODE | Exploring and Evaluating Hallucinations in LLM-Powered Code Generation | Arxiv 2024/04 | ||
CodeHalu | CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification | AAAI 2025 | Github |
š¤Dataset |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
DS-1000 | DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation | ICML 2023 | Github |
š¤Dataset šHomePage |
ARCADE | Natural Language to Code Generation in Interactive Data Science Notebooks | ACL 2023 | Github |
Dataset |
DA-Code | DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models | EMNLP 2024 | Github |
š¤Dataset šWebsite |
MatPlotBench | MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization | ACL 2024 Findings | Github |
š¤Dataset |
Benchmark | Paper | Date | Github | Dataset & Website & LeaderBoard |
---|---|---|---|---|
VerilogEval | VerilogEval Evaluating Large Language Models for Verilog Code Generation | ICCAD 2023 | Github |
š¤Dataset |
VGen | Benchmarking Large Language Models for Automated Verilog RTL Code Generation | DATE 2023 | Github |
š¤Dataset |
RTLLM | RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model | ASPDAC 2024 | Github |
š¤Dataset |
LLM4PLC | LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems | ICSE 2024 | Github |
šWebsite |
Agents4PLC | Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents | Arxiv 2024/10 | Github |
š¤Dataset |
A Multi-Agent Framework for Extensible Structured Text Generation in PLCs | Arxiv 2024/12 | |||
MetRex | MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs | ASPDAC 2025 | Github |
š¤Dataset |