Benchmarks I Like
Last updated May 10, 2025
AI evaluation is a maze. Some tests look rigorous but collapse under data-leak scrutiny; some leaderboards reward emoji spam instead of real skill. To keep myself—and anyone who visits—focused on signal, I curate four kinds of resources here:
Real-world impact benchmarks: Tasks that translate directly into economic output or long-horizon autonomy. They tell us how close we are to agentic systems that earn or save money without hand-holding.
Frontier capability benchmarks: Fresh, contamination-resistant challenges. They stretch today’s best models and flag genuine capability jumps, even when headline scores on older exams have plateaued.
Human-preference leaderboards: Live, blinded evaluations where human voters pick their favorite output. They can surface subjective qualities that automated metrics can’t capture, but they’re also vulnerable to crowd bias and gaming
Aggregators and dashboards: Composite indexes and one-stop scoreboards that pull many tests into one view, useful for quick model comparisons—provided you understand the tests behind the numbers.
For each entry you’ll see what it is, why I value it, and watch-outs so you can judge results in context. I update the list whenever a new benchmark proves its worth or an old favorite shows cracks.
Real-world impact benchmarks
SWE-Lancer
(Link goes to paper.)
Description: Corpus of real Upwork software-engineering gigs—from $50 bug fixes to $30K feature builds—framed as autonomous tasks for agents.
Why I like it: Directly measures economic value: can an AI earn freelance income without human help?
Caveats: Evaluation still happens offline; no live leaderboard yet, so reproducibility depends on private re-runs.
METR Long-Horizon Time-Horizon Evaluation
Description: Tracks how many sequential coding steps an agent can complete before failure, updated quarterly.
Why I like it: Offers a single, intuitive number—task length—that rises as autonomy improves.
Caveats: Currently limited to software projects; extrapolating the same “steps-to-failure” curve to other domains is still speculative.
Vending-Bench
Description: Simulated vending-machine business that forces an agent to plan inventory, prices, and restocking over thousands of steps.
Why I like it: Captures long-horizon coherence and profit-and-loss thinking—skills most toy tasks ignore.
Caveats: Environment abstraction hides messy real-world variables (e.g. logistics, theft, hardware faults), so results may overestimate deploy-time robustness.
Vectara Hallucination Leaderboard
Description: Tracks answer-level hallucination rates for dozens of LLMs by posing factual questions, forcing each model to cite sources, and auto-checking those citations against ground-truth passages.
Why I like it: Zeroes in on a single, business-critical failure mode—making things up—and reports a clean percentage score, so I can spot trustworthy candidates for retrieval-augmented workflows at a glance.
Caveats: Uses a fixed question set and static grounding corpus; models fine-tuned on that data or citation format can overfit, and a “correct citation” doesn’t always capture subtler factual slips or outdated information.
Frontier capability benchmarks
FrontierMath
(Link goes to Epoch AI Benchmarking Hub. Here’s an overview.)
Description: Epoch AI’s closed-source set of fresh Olympiad-style math problems that have never appeared online, preventing training-data leakage.
Why I like it: Provides a contamination-free yardstick for pure mathematical reasoning progress.
Caveats: Math prowess matters, but many enterprise use-cases hinge more on language, multimodal, and agentic skills, so real-world relevance is indirect.
GPQA
(Link goes to Epoch AI Benchmarking Hub. Here’s the original paper.)
Description: Graduate-level multiple-choice science questions designed to thwart pure retrieval and test reasoning.
Why I like it: Mirrors the kind of conceptual problem-solving scientists face—good proxy for real-world research tasks.
Caveats: Almost saturated; top models already beat humans and exceed 80%; we should consider the successor SuperGPQA next.
Humanity’s Last Exam
(Link goes to Scale’s leaderboard page for the Exam. Here’s the original paper.)
Description: 2,500 expert-written questions across dozens of disciplines, pitched as the “final boss” of closed-ended academic benchmarks. arXiv
Why I like it: Raises the bar after GPT-4 steam-rolled legacy exams, offering an early warning system for the next capability jump.
Caveats: Breadth trumps depth; many items resemble Olympiad puzzles rather than day-to-day tasks, so wins here won’t always translate to economic impact.
Human preference leaderboards
Chatbot Arena
Description: Crowdsourced leaderboard where anonymous human voters pick a winner in blind, head-to-head chat battles, producing a global rank for each LLM.
Why I like it: Human judgment—rather than multiple-choice tests—promises a refreshing, real-conversation signal that complements automated benchmarks.
Caveats: Recent critiques show the arena can be gamed: labs enter dozens of private model variants and report only the best scores, while optimizing for emojis, flattery (“sycophancy”), and other superficial variables skews voter preferences.
Minecraft Benchmark (MC-Bench)
Description: Pits frontier AIs against the same build prompt in a sandboxed Minecraft server; humans vote blind on which blocky creation looks better, yielding an Elo-style leaderboard.
Why I like it: Tests many AI skills, including creativity, spatial reasoning, and code-generation. Also, judging a pineapple statue is visual and intuitive, so even non-experts can spot real capability differences without the undue influences of human psychology hacks like flattery.
Caveats: Aesthetic preferences, voting demographics, and user knowledge can skew scores (I sometimes don’t even know what the thing they’re generating is supposed to look like). Also, success in a stylized game world may not generalize to physical-world constraints or enterprise workflows.
Aggregators and dashboards
Artificial Analysis
Description: Independent dashboard that compares leading LLMs on a composite intelligence index plus speed and price.
Why I like it: One glance shows which models deliver the best mix of smarts, latency, and cost.
Caveats: The intelligence index is based on underlying benchmarks that can have flaws (e.g., training-data contamination), which then propagate into the computed score.
Epoch AI Benchmarking Hub
Description: The nonprofit EpochAI curates a public hub that tracks its own contamination-proof tests (e.g., FrontierMath) alongside independently run benchmarks such as GPQA Diamond, with interactive charts and data downloads.
Why I like it: Trusted provider, great charting, and easy data exportation.
Caveats: Only a handful of benchmarks and updates can lag model releases.
Scale AI SEAL Leaderboards
Description: Scale’s SEAL (Systematic Evaluation & Alignment) leaderboards rank frontier AIs on hidden, expert-graded datasets designed to resist prompt-leakage and training-data contamination.
Why I like it: Human evaluators plus undisclosed test sets make the scores much harder to game.
Caveats: Datasets and weighting rubric remain proprietary, and participation is opt-in, so results cover only models that submit to Scale’s evaluation pipeline.