Skip to content

A comprehensive code domain benchmark review of LLM researches.

License

Notifications You must be signed in to change notification settings

tongye98/Awesome-Code-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Ā 

History

71 Commits
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 

Repository files navigation

šŸ‘Øā€šŸ’» Awesome Code Benchmark

Awesome PRs Welcome

A comprehensive code domain benchmark review of LLM researches.

Oryx Video-ChatGPT

News

  • šŸ”„šŸ”„ [2025-04-21] Featured Benchmarks:

    šŸ”„ SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents from AWS AI Labs

  • [2025-04-18] We add Github Stars for each banchmark.

  • [2025-04-13] We add Code Security & Robustness benchmarks.Ā 

  • [2025-04-06] We add Code Hallucinations benchmarks.Ā 

  • [2025-03-29] We have crawled all the articles related to code benchmarks in the past five years.Ā 

  • [2025-03-17] We add Code Version (Version-specific code generation) benchmarks.Ā 

  • [2025-03-16] A thorough review of code domain benchmarks for LLM research has been released.Ā 

alt text

Table of Content

šŸš€ Top Code Benchmark

Code Completion & Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
HumanEval Evaluating Large Language Models Trained on Code Arxiv 2021/07 Github Stars šŸ¤—Dataset
MBPP Program Synthesis with Large Language Models Arxiv 2021/08 Github Stars šŸ¤—Dataset
APPS Measuring Coding Challenge Competence With APPS NeurIPS 2021 Github Stars šŸ¤—Dataset
CodeContests Competition-Level Code Generation with AlphaCode Science 2022 Github Stars Dataset
MultiPL-E MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation TSE 2023 Github Stars šŸ¤—Dataset
MCoNaLa MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages EACL 2023 Findings Github Stars šŸ¤—Dataset
LCC LongCoder: A Long-Range Pre-trained Language Model for Code Completion ICML 2023 Github Dataset
CodeClarQA Python Code Generation by Asking Clarification Questions ACL 2023 Github Stars Dataset
EvalPlus Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation NeurIPS 2023 Github Stars šŸ¤—Dataset šŸ“ŠLeaderBoard
CrossCodeEval CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion NeurIPS 2023 Github Stars Dataset
ODEX Execution-Based Evaluation for Open-Domain Code Generation EMNLP 2023 Findings Github Stars Dataset
HumanEval-X CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X SIGKDD 2023 Github Stars Dataset
ML-Bench ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code Arxiv 2023/11 Github Stars šŸ¤—Dataset 🌐Website
RepoBench RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems ICLR 2024 Github Stars šŸ¤—Dataset
CatCoder Enhancing Repository-Level Code Generation with Integrated Contextual Information Arxiv 2024/06
StudentEval StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024 Findings GithubStars šŸ¤—Dataset
DevEval DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories ACL 2024 Github Stars šŸ¤—Dataset
CoderEval CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models ICSE 2024 Github Stars
ConCodeEval ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages Arxiv 2024/07
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 GithubStars šŸ“ŠLeaderBoard
šŸ¤—Dataset
OOP OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models ACL 2024 Findings Github Stars šŸ¤—Dataset
L2CEval L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models TACL 2024
HumanExtension Exploring Language Model's Code Generation Ability with Auxiliary Functions NAACL 2024 Findings Github Stars šŸ¤—Dataset
LLM4Decompile LLM4Decompile: Decompiling Binary Code with Large Language Models EMNLP 2024 GithubStars šŸ¤—Dataset
PYCOMMITS Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing ICLR 2024 Github Stars Dataset
CodeAgentBench CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges ACL 2024
SAFIM Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks ICML 2024 GithubStars šŸ¤—Dataset
BigCodeBench BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions ICLR 2025 Github Stars šŸ¤—Dataset šŸ“ŠLeaderBoard
EvoCodeBench EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories NeurIPS 2025 Github Stars šŸ¤—Dataset
DynaCode DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation Arxiv 2025/03
Details of Code Completion & Code Generation Benchmarks :: click to expand ::
  • HumanEval: code completion
  • MBPP: text -> code; code generation
  • APPS: a benchmark for code generation from natural language specifications
  • CodeContests: complex programming task
  • MultiPL-E: extends the HumanEval and MBPP benchmarks to 18 languages
  • MCoNaLa: code generation from multiple natural languages
  • LCC: long code context code completion
  • CodeClarQA: containing pairs of natural language descriptions and code with created synthetic clarification questions and answers.
  • EvalPlus: extends the HumanEval and MBPP benchmarks
  • CrossCodeEval: diverse and multilingual benchmark for cross-file code completion
  • ODEX: open-domain, execution-based natural language to code generation
  • HumanEval-X: multilingual ability of code generation
  • ML-Bench: repo-level ML task solving benchmark using real-world code
  • RepoBench: repo-level code auto-completion
  • CatCoder: a framework for repo-level code generation in statically typed languages using code and type context
  • StudentEval: a benchmark of student-written prompts for code generation evaluation
  • DevEval: repo-level code generation
  • CoderEval: pragmatic code generation
  • ConCodeEval: benchmark for assessing LLMs' understanding of code constraints in domain-specific languages like JSON and YAML
  • CodeScope: Multilingual Multitask Multidimensional Benchmark (execution-based)
  • OOP: object-oriented programming evaluation benchmark of python programs
  • L2CEval: multilingual, multi-task NL-to-code benchmark including semantic parsing, math reasoning and Python programming.
  • HumanExtension:auxiliary-function-based code generation benchmark
  • LLM4Decompile: benchmark for evaluating binary-to-C decompilation on real-world open-source binaries
  • PYCOMMITS: multi-round Python code editing benchmark from real commit histories
  • CodeAgentBench: repo-level code generation benchmark with tool-integrated agents for real-world tasks
  • SAFIM: syntax-aware code completion benchmark focusing on code blocks and conditional expressions​
  • BigCodeBench: complete Split & Instruct Split
  • EvoCodeBench: evolving Python code generation benchmark from real GitHub commits
  • DynaCode: a dynamic complexity-aware code benchmark

Code Efficiency

Benchmark Paper Date Github Dataset & Website & LeaderBoard
EvalPerf Evaluating Language Models for Efficient Code Generation COLM 2024 Github Stars šŸ¤—Dataset 🌐Website
EffiBench EffiBench: Benchmarking the Efficiency of Automatically Generated Code NeurIPS 2024 Github Stars
Mercury Mercury: A Code Efficiency Benchmark for Code Large Language Models NeurIPS 2024 Github Stars šŸ¤—Dataset
ECCO ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? EMNLP 2024 Github Stars šŸ¤—Dataset
PIE Learning Performance-Improving Code Edits ICLR 2024 Github Stars 🌐Website
ENAMEL How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark ICLR 2025 Github Stars šŸ¤—Dataset

CodeFix & Bug-Fix

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Buggy-HumanEval&Buggy-FixEval Large Language Models of Code Fail at Completing Code with Potential Bugs NeurIPS 2023 GithubStars Dataset
SWT-Bench SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents NeurIPS 2024 Github Stars 🌐Website
HumanEvalPack OctoPack: Instruction Tuning Code Large Language Models ICLR 2024 GithubStars šŸ¤—Dataset
SWE-bench SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024 Github Stars 🌐Website
GitBug-Java GitBug-Java: A Reproducible Benchmark of Recent Java Bugs MSR 2024 GithubStars šŸ¤—Dataset 🌐Website
GitBug-Actions GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions ICSE 2024 Demo Github Stars ā–¶ļøVideo
RepoBugs When Large Language Models Confront Repository-Level Automatic Program Repair: How Well They Done? ICSE 2024 Industry Track
RepoFixEval RepoFixEval: A Repository-Level Program Repair Benchmark From Issue Discovering to Bug Fixing OpenReview 2024 Link
DebugBench DebugBench: Evaluating Debugging Capability of Large Language Models ACL 2024 Github Stars šŸ¤—Dataset
Multi-Bug Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging EMNLP 2024 Findings Github Stars
Coffee-Gym Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code EMNLP 2024 šŸ¤—Dataset
INTERVENOR INTERVENOR: Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing ACL 2024 Findings Github Stars
StatType-SO ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using LLMs TOSEM 2024
LiveCodeBench LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website šŸ“ŠLeaderBoard
SWE-bench Multimodal SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website
FeedbackEval FeedbackEval A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks GithubStars

Code Reasoning & Understanding

Benchmark Paper Date Github Dataset & Website & LeaderBoard
GenCodeSearchNet GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding EMNLP 2023 GithubStars šŸ¤—Dataset
CRUXEval CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Arxiv 2024/01 Github Stars šŸ“ŠLeaderBoard
Poor-CodeSumEval How Effectively Do Code Language Models Understand Poor-Readability Code? ASE 2024 Github Stars šŸ¤—Dataset
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 GithubStars šŸ“ŠLeaderBoard
šŸ¤—Dataset
CodeJudge-Eval CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? COLING 2025 Github Stars
CodeMMLU CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website šŸ“ŠLeaderBoard

Code Hallucination

Benchmark Paper Date Github Dataset & Website & LeaderBoard
HALLUCODE Exploring and Evaluating Hallucinations in LLM-Powered Code Generation Arxiv 2024/04
CodeHalu CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification AAAI 2025 GithubStars šŸ¤—Dataset

Data science

Benchmark Paper Date Github Dataset & Website & LeaderBoard
DS-1000 DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation ICML 2023 GithubStars šŸ¤—Dataset 🌐HomePage
ARCADE Natural Language to Code Generation in Interactive Data Science Notebooks ACL 2023 Github Stars Dataset
DA-Code DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models EMNLP 2024 GithubStars šŸ¤—Dataset 🌐Website
MatPlotBench MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization ACL 2024 Findings GithubStars šŸ¤—Dataset

Text2SQL

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Spider Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task EMNLP 2018 GithubStars 🌐Website
SParC SParC: Cross-Domain Semantic Parsing in Context ACL 2019 Github Stars 🌐Website
CoSQL CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases EMNLP 2019 Github Stars 🌐Website
Spider-DK Exploring underexplored limitations of crossdomain text-to-sql generalization EMNLP 2021 Github Stars
Spider-Syn Towards robustness of text-to-SQL models against synonym substitution ACL 2021 Github Stars
Spider-Realistic Structure-Grounded Pretraining for Text-to-SQL NAACL 2021 Dataset
BIRD Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs NeurIPS 2023 Github Stars 🌐Website
Dr.Spider Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness ICLR 2023 GithubStars
BookSQL BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain NAACL 2024 Github Stars Dataset
Archer Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning EACL 2024 🌐Website
SecureSQL SecureSQL: Evaluating Data Leakage of Large Language Models as Natural Language Interfaces to Databases EMNLP 2024 Findings Github Stars Dataset
Spider 2.0 Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows ICLR 2025 Github Stars 🌐Website
SNAILS SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference PACMMOD 2025 GithubStars
SQL2Text Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text COLING 2025 GithubStars Dataset

MultiModal Code Tasks

Benchmark Paper Date Github Dataset & Website & LeaderBoard
MMCode MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems EMNLP 2024 GithubStars šŸ¤—Dataset
Drawing Pandas Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code ArXiv 2024/12 GithubStars šŸ¤—Dataset
Web2Code Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs NeurIPS 2024 GithubStars šŸ¤—Dataset
🌐Website
VGBench VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation EMNLP 2024 Github Stars šŸ¤—Dataset
SVGEditBench SVGEditBench: A Benchmark Dataset for Quantitative Assessment of LLM's SVG Editing Capabilities CVPR2024 workshop Github Stars šŸ¤—Dataset
Plot2Code Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots Arxiv 2024/05 GithubStars šŸ¤—Dataset
HumanEval-V HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks ArXiv 2024/10 GithubStars 🌐Website
šŸ“ŠLeaderBoard
šŸ¤—Dataset
WebSight-Test WAFFLE: Multi-Modal Model for Automated Front-End Development Arxiv 2024/10 GithubStars šŸ¤—Dataset
Sketch2Code Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping Arxiv 2024/10 GithubStars 🌐Website
Interaction2Code Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping Arxiv 2024/11 GithubStars šŸ¤—Dataset
šŸ“ŠLeaderBoard
ScratchEval ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges Arxiv 2024/11 GithubStars šŸ¤—Dataset
MRWeb MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs Arxiv 2024/12 GithubStars šŸ¤—Dataset
Image2Struct Image2Struct: Benchmarking Structure Extraction for Vision-Language Models NeurIPS 2024 GithubStars 🌐Website
šŸ¤—Dataset
BigDocs-Bench BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks ICLR 2025 šŸ¤—Dataset
🌐Website
WebCode2M WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs WWW 2025 Github 🌐Website
šŸ¤—Dataset
Design2Code Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering NAACL 2025 GithubStars šŸ¤—Dataset
DiagramGenBenchmark From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing CVPR 2025 GithubStars 🌐Website
šŸ¤—Dataset
ChartMimic ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation ICLR 2025 GithubStars 🌐Website šŸ¤—Dataset
SVG-Bench StarVector: Generating Scalable Vector Graphics Code from Images and Text CVPR 2025 Github Stars 🌐Website
šŸ¤—Dataset
LLM4SVG Empowering LLMs to Understand and Generate Complex Vector Graphics CVPR 2025 GithubStars 🌐Website
ChartCoder ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation Arxiv 2025/01 Github Stars šŸ¤—Dataset
Code-Vision Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities Arxiv 2025/02
Flame-React-Eval Advancing vision-language models in front-end development via data synthesis Arxiv 2025/03 Github šŸ¤—Dataset

Code Security & Robustness

Benchmark Paper Date Github Dataset & Website & LeaderBoard
COCO COCO: Testing Code Generation Systems via Concretized Instructions Arxiv 2023/08 Github Stars
ReCode ReCode: Robustness Evaluation of Code Generation Models ACL 2023 Github Stars Dataset
RedCode RedCode: Risky Code Execution and Generation Benchmark for Code Agents NeurIPS 2024 Github Stars 🌐Website šŸ“ŠLeaderBoard
CodeWMBench CodeWMBench: An Automated Benchmark for Code Watermarking Evaluation ACM-TURC 2024 Github Stars
RMCBench RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code ASE 2024 Github Stars šŸ¤—Dataset
PyP4LLMSec Benchmarking the Security Aspect of Large Language Model-Based Code Generation ICSE 2024 Github Stars Dataset
CWE-Bench-Java IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities Arxiv 2024/05 Github
CyberSecEval 3 CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models Arxiv 2024/08 Github Dataset
CS-Eval CS-Eval: A Comprehensive Large Language Model Benchmark for CyberSecurity Arxiv 2024/11 Github Stars šŸ¤—Dataset
SecBench SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity Arxiv 2024/12   Dataset 🌐Website

Code Translation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
TransCoder Unsupervised Translation of Programming Languages NeurIPS 2020 Github(deprecated) Github(new) Stars Dataset
AVATAR AVATAR: A Parallel Corpus for Java-Python Program Translation ACL Findings 2023 Github Stars Dataset
G-TransEval On the Evaluation of Neural Code Translation: Taxonomy and Benchmark ASE 2023 Github Stars šŸ¤—Dataset
CodeTransOcean CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation EMNLP 2023 Github Stars šŸ¤—Dataset
xCodeEval XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval ACL 2024 GithubStars šŸ¤—Dataset
PolyHumanEval Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? APSEC 2024 GithubStars šŸ¤—Dataset
RustRepoTrans Repository-level Code Translation Benchmark Targeting Rust Arxiv 2024/11 Github Stars šŸ¤—Dataset
ClassEval-T Escalating LLM-based Code Translation Benchmarking into the Class-level Era Arxiv 2024-11 GithubStars šŸ¤—Dataset
TRANSREPO-BENCH Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation Arxiv 2025/01 Github Stars šŸ¤—Dataset
LongTrans Enhancing LLMs in Long Code Translation through Instrumentation and Program State Alignment Arxiv 2025/04

Code Version

Benchmark Paper Date Github Dataset & Website & LeaderBoard
CodeUpdateEval Automatically Recommend Code Updates: Are We There Yet? TOSEM 2024 Github Stars šŸ¤—Dataset
JavaVersionGenBench On the Generalizability of Deep Learning-based Code Completion Across Programming Language Versions ICPC 2024 GithubStars šŸ¤—Dataset
VersiCode VersiCode: Towards Version-controllable Code Generation Arxiv 2024/10 Github Stars 🌐Website šŸ¤—Dataset
GitChameleon GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Arxiv 2024/11 Github Stars šŸ¤—Dataset
LLM-Deprecated-APl LLMs Meet Library Evolution: Evaluating Deprecated API Usage in LLM-based Code Completion ICSE 2025 Github Stars šŸ¤—Dataset
LibEvolutionEval LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation NAACL 2025
CodeUpdateArena CodeUpdateArena: Benchmarking Knowledge Editing on API Updates Arxiv 2025/02 Github Stars šŸ¤—Dataset
RustEvo2 RustEvo2: An Evolving Benchmark for API Evolution in LLM-based Rust Code Generation Arxiv 2025/03 Github Stars šŸ¤—Dataset

Multi & Other Dimension

Benchmark Paper Date Github Dataset & Website & LeaderBoard
Stack-Repo RepoFusion: Training Code Models to Understand Your Repository Arxiv 2023/06 GithubStars šŸ¤—Dataset
MultiNL-H Improving Natural Language Capability of Code Large Language Model Arxiv 2024/01 GithubStars
HumanEvalPack OctoPack: Instruction Tuning Code Large Language Models ICLR 2024 GithubStars šŸ¤—Dataset
CodeBenchGen CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks Arxiv 2024/04 GithubStars Dataset
X-HumanEval-X Exploring Multi-Lingual Bias of Large Code Models in Code Generation Arxiv 2024/04
RACE Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models Arxiv 2024/07 GithubStars šŸ“ŠLeaderBoard
RealWorld-Bench What's Wrong with Your Code Generated by Large Language Models? An Extensive Study Arxiv 2024/07
APPS+ StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback ACL 2024 GithubStars Dataset
InfiBench InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models NeurIPS 2024 GithubStars 🌐Website
RobustAPI Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation AAAI 2024 GithubStars šŸ¤—Dataset
EvoEval Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM COLM 2024 Github Stars
CodeScope CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024 GithubStars šŸ“ŠLeaderBoard
šŸ¤—Dataset
AssertionBench AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation NAACL 2025 GithubStars
REval Evaluating Large Language Models with Runtime Behavior of Program Execution ICSE 2025 GithubStars šŸ“ŠLeaderBoard
LiveCodeBench LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code ICLR 2025 Github Stars šŸ¤—Dataset 🌐Website šŸ“ŠLeaderBoard
SWE-PolyBench SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents Arxiv 2025/04 Github Stars 🌐Website šŸ¤—Dataset

Industry Code Generation

Benchmark Paper Date Github Dataset & Website & LeaderBoard
VerilogEval VerilogEval Evaluating Large Language Models for Verilog Code Generation ICCAD 2023 GithubStars šŸ¤—Dataset
VGen Benchmarking Large Language Models for Automated Verilog RTL Code Generation DATE 2023 GithubStars šŸ¤—Dataset
RTLLM RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model ASPDAC 2024 GithubStars šŸ¤—Dataset
LLM4PLC LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems ICSE 2024 GithubStars 🌐Website
Agents4PLC Agents4PLC: Automating Closed-loop PLC Code Generation and Verification in Industrial Control Systems using LLM-based Agents Arxiv 2024/10 GithubStars šŸ¤—Dataset
A Multi-Agent Framework for Extensible Structured Text Generation in PLCs Arxiv 2024/12
MetRex MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs ASPDAC 2025 GithubStars šŸ¤—Dataset