all-leaderboard / README.md
mayafree's picture
Update README.md
d807d13 verified
metadata
title: Leaderboard of Leaderboards
emoji: 🔥
colorFrom: pink
colorTo: yellow
sdk: static
pinned: false
license: mit
short_description: Real-time rankings of the most trusted leaderboard

Leaderboard of Leaderboards

Every AI benchmark leaderboard on HuggingFace, ranked in one place and updated in real time.

What This Is

The AI evaluation landscape has exploded. Hundreds of leaderboards now exist across language modeling, coding, multimodal reasoning, agent behavior, speech synthesis, safety, and more. Knowing which benchmarks actually matter — which ones the global research community trusts, cites, and returns to — used to require crawling dozens of pages manually.

Leaderboard of Leaderboards is a meta-leaderboard. It ranks the leaderboards themselves, scored by real HuggingFace community signals: live trending scores and cumulative likes. The result is a continuously updated map of credibility across the AI evaluation ecosystem.

Why It Matters

Most discussions about AI performance ask which model is best. This project asks a different and more fundamental question: which evaluation standards does the community actually trust? A benchmark that thousands of researchers visit and endorse carries a different weight than one that exists in isolation. By surfacing that signal, Leaderboard of Leaderboards helps researchers, engineers, and organizations make better decisions about how to measure intelligence — not just which models claim to possess it.

What You Can Explore

Sort by trending to see which benchmarks are capturing attention right now. Sort by likes to see which have earned lasting credibility over time. Filter by domain to focus on the evaluation criteria most relevant to your work, whether that is general language ability, code generation, retrieval, multimodal perception, or AI safety. Use the search to locate a specific leaderboard instantly.

Each entry shows the leaderboard's rank within this collection alongside its real-time global rank across all HuggingFace Spaces. That second number tells you not just how a benchmark compares to other leaderboards, but where it stands in the entire HuggingFace ecosystem.

Notable Leaderboards in This Collection

Open LLM Leaderboard by HuggingFace evaluates open-source language models on IFEval, BBH, MATH, GPQA, MUSR, and MMLU-Pro. It is the most widely cited open benchmark for general language model capability.

Chatbot Arena by lmarena-ai ranks models using Elo ratings derived from over one million blind pairwise human preference votes. Because it relies on human judgment rather than automated metrics, it is considered among the most ecologically valid benchmarks in existence.

MTEB covers 58 datasets across 112 languages and 8 task types, making it the definitive ranking for text embedding and retrieval models.

BigCodeBench tests models on 1,140 real-world programming tasks spanning 139 libraries, measuring practical software engineering capability rather than toy coding puzzles.

FINAL Bench, short for Frontier Intelligence Nexus for AGI-Level Verification, evaluates models across 100 tasks covering 15 domains using the TICOS assessment framework. It is designed to probe the boundaries of what current AI systems can and cannot do, targeting capabilities associated with general intelligence rather than narrow task performance. It reached the global top 5 in HuggingFace dataset rankings within weeks of release.

Smol AI WorldCup is a tournament-format benchmark specifically designed for sub-8B parameter models. Rather than evaluating models on static test sets, it stages head-to-head competitions scored automatically using FINAL Bench criteria, making it one of the few evaluation frameworks built for the emerging class of efficient small language models.

ALL Bench brings together evaluation results across multiple benchmark frameworks into a unified leaderboard, offering a cross-benchmark perspective on model performance that reduces the risk of benchmark-specific overfitting distorting rankings.

How Rankings Are Determined

Data is pulled live from the HuggingFace Hub API. Trending rankings reflect the platform's real-time trending score, which weights recent visit velocity and engagement. Like rankings reflect total community endorsements accumulated over the lifetime of each Space. Neither ranking is curated or editorially influenced. The order you see is the order the global AI community has produced through its collective behavior.

Who This Is For

AI researchers deciding which benchmarks to include in their evaluation suite. Engineers comparing models before deployment and needing an authoritative reference point. Journalists and analysts tracking which evaluation standards are shaping the field. Organizations building AI products who need to ground their performance claims in community-validated criteria. And anyone who believes that how we measure intelligence matters as much as the intelligence we build.