reference performance of AI models on coding tasks #278

monperrus · 2025-03-21T06:49:08Z

we copy-paste notable coding leaderboard snapshots, with source

this will be useful to argue for the choice of models in our future papers.

monperrus · 2025-03-21T06:50:38Z

LMarena coding March 21 2025

monperrus · 2025-03-21T06:52:03Z

LiveCodeBench code generation March 21 2025

monperrus · 2025-03-21T06:53:00Z

RepairBench 21 March 2025

monperrus · 2025-03-21T06:53:34Z

CodeArena: A Collective Evaluation Platform for LLM Code Generation
https://arxiv.org/pdf/2503.01295

tldr: This work introduces CodeArena, an online evaluation framework tailored for LLM code generation, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage.

monperrus · 2025-03-25T05:20:47Z

BigCode leaderboard (HumanEval benchmark and MultiPL-E.) March 25 2025

monperrus · 2025-03-31T11:40:29Z

BigO(Bench) Leaderboard https://facebookresearch.github.io/BigOBench/leaderboard.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reference performance of AI models on coding tasks #278

reference performance of AI models on coding tasks #278

monperrus commented Mar 21, 2025 •

edited

Loading

monperrus commented Mar 21, 2025

monperrus commented Mar 21, 2025

monperrus commented Mar 21, 2025

monperrus commented Mar 21, 2025

monperrus commented Mar 25, 2025

monperrus commented Mar 31, 2025

reference performance of AI models on coding tasks #278

reference performance of AI models on coding tasks #278

Comments

monperrus commented Mar 21, 2025 • edited Loading

monperrus commented Mar 21, 2025

monperrus commented Mar 21, 2025

monperrus commented Mar 21, 2025

monperrus commented Mar 21, 2025

monperrus commented Mar 25, 2025

monperrus commented Mar 31, 2025

monperrus commented Mar 21, 2025 •

edited

Loading