Leaderboards - a kaizuberbuehler Collection

kaizuberbuehler 's Collections

Reasoning, Thinking, RL and Test-Time Scaling

Vision Language Models

Foundation Models

Synthetic Data and Self-Improvement

Agents

LM Prompt Engineering

LM Capabilities and Scaling

LM Architectures

Code Generation

EXL2 Quantized Models

Leaderboards

updated Sep 14, 2025

Running

230

BigCodeBench Leaderboard

🥇

230

Explore and analyze code completion benchmarks
Running

1.47k

UGI Leaderboard

📢

1.47k

Uncensored General Intelligence Leaderboard
Configuration error

4.71k

LMArena Leaderboard

🏆

4.71k

View the LMArena model performance leaderboard
Running on CPU Upgrade

7.03k

MTEB Leaderboard

🥇

7.03k

Embedding Leaderboard
Running on CPU Upgrade

13.8k

Open LLM Leaderboard

🏆

13.8k

Track, rank and evaluate open LLMs and chatbots
Running

1.49k

Big Code Models Leaderboard

📈

1.49k

Explore and submit evaluations for code generation models
Running

95

OpenCompass LLM Leaderboard

🚀

95

Display a web page
Running on CPU Upgrade

Featured

1.22k

Open ASR Leaderboard

🏆

1.22k

Explore speech model benchmarks and submit evaluation requests
Running

Featured

574

Image Arena Leaderboard

📊

574

Image Generation and Image Editing Arena & Leaderboard
Running

Featured

438

LLM Performance Leaderboard

🐨

438

View LLM performance rankings on an interactive leaderboard
WebArena: A Realistic Web Environment for Building Autonomous Agents

Paper • 2307.13854 • Published Jul 25, 2023 • 26
Running

91

Zebra Logic Bench

🦓

91

Display and explore a leaderboard for model evaluations
Runtime error

21

LiveBench

🥇

21
Running

89

imgsys.org

📊

89

imgsys.org -- arena for text guided image generation
Running

53

ZeroEval Leaderboard

📊

53

Embed ZeroEval for evaluation
SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

Paper • 2408.14354 • Published Aug 26, 2024 • 41
Running

64

Hallucination Evaluation Leaderboard

⚡

64

Redirect to leaderboard page
Running on CPU Upgrade

189

LLM Hallucination Leaderboard

🚀

189

View and filter LLM hallucination leaderboard
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Paper • 2404.07972 • Published Apr 11, 2024 • 51
A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published Jan 2, 2025 • 22
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published Dec 18, 2024 • 51
Running

462

TTS Spaces Arena

🤗

462

Blind vote on HF TTS models!
MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14, 2025 • 300
Running

25

BrowserGym Leaderboard

🏆

25

Tracks perf of LLMs, VLMs and agents on web navigation tasks
Running

87

DABstep Leaderboard

🕺

87

DABstep Reasoning Benchmark Leaderboard
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

Paper • 2502.01081 • Published Feb 3, 2025 • 13
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance

Paper • 2502.08127 • Published Feb 12, 2025 • 58
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published Feb 13, 2025 • 43
Running on CPU Upgrade

444

Agent Leaderboard

💬

444

Ranking of LLMs for agentic tasks
TextArena

Paper • 2504.11442 • Published Apr 15, 2025 • 30