Benchmarks I Like

Last updated May 10, 2025

AI evaluation is a maze. Some tests look rigorous but collapse under data-leak scrutiny; some leaderboards reward emoji spam instead of real skill. To keep myself—and anyone who visits—focused on signal, I curate four kinds of resources here:

  • Real-world impact benchmarks: Tasks that translate directly into economic output or long-horizon autonomy. They tell us how close we are to agentic systems that earn or save money without hand-holding.

  • Frontier capability benchmarks: Fresh, contamination-resistant challenges. They stretch today’s best models and flag genuine capability jumps, even when headline scores on older exams have plateaued.

  • Human-preference leaderboards: Live, blinded evaluations where human voters pick their favorite output. They can surface subjective qualities that automated metrics can’t capture, but they’re also vulnerable to crowd bias and gaming

  • Aggregators and dashboards: Composite indexes and one-stop scoreboards that pull many tests into one view, useful for quick model comparisons—provided you understand the tests behind the numbers.

For each entry you’ll see what it is, why I value it, and watch-outs so you can judge results in context. I update the list whenever a new benchmark proves its worth or an old favorite shows cracks.


Real-world impact benchmarks

SWE-Lancer

(Link goes to paper.)

  • Description: Corpus of real Upwork software-engineering gigs—from $50 bug fixes to $30K feature builds—framed as autonomous tasks for agents.

  • Why I like it: Directly measures economic value: can an AI earn freelance income without human help?

  • Caveats: Evaluation still happens offline; no live leaderboard yet, so reproducibility depends on private re-runs.

METR Long-Horizon Time-Horizon Evaluation

  • Description: Tracks how many sequential coding steps an agent can complete before failure, updated quarterly.

  • Why I like it: Offers a single, intuitive number—task length—that rises as autonomy improves.

  • Caveats: Currently limited to software projects; extrapolating the same “steps-to-failure” curve to other domains is still speculative.

Vending-Bench

  • Description: Simulated vending-machine business that forces an agent to plan inventory, prices, and restocking over thousands of steps.

  • Why I like it: Captures long-horizon coherence and profit-and-loss thinking—skills most toy tasks ignore.

  • Caveats: Environment abstraction hides messy real-world variables (e.g. logistics, theft, hardware faults), so results may overestimate deploy-time robustness.

Vectara Hallucination Leaderboard

  • Description: Tracks answer-level hallucination rates for dozens of LLMs by posing factual questions, forcing each model to cite sources, and auto-checking those citations against ground-truth passages.

  • Why I like it: Zeroes in on a single, business-critical failure mode—making things up—and reports a clean percentage score, so I can spot trustworthy candidates for retrieval-augmented workflows at a glance.

  • Caveats: Uses a fixed question set and static grounding corpus; models fine-tuned on that data or citation format can overfit, and a “correct citation” doesn’t always capture subtler factual slips or outdated information.

Frontier capability benchmarks

FrontierMath

(Link goes to Epoch AI Benchmarking Hub. Here’s an overview.)

  • Description: Epoch AI’s closed-source set of fresh Olympiad-style math problems that have never appeared online, preventing training-data leakage.

  • Why I like it: Provides a contamination-free yardstick for pure mathematical reasoning progress.

  • Caveats: Math prowess matters, but many enterprise use-cases hinge more on language, multimodal, and agentic skills, so real-world relevance is indirect.

GPQA

(Link goes to Epoch AI Benchmarking Hub. Here’s the original paper.)

  • Description: Graduate-level multiple-choice science questions designed to thwart pure retrieval and test reasoning.

  • Why I like it: Mirrors the kind of conceptual problem-solving scientists face—good proxy for real-world research tasks.

  • Caveats: Almost saturated; top models already beat humans and exceed 80%; we should consider the successor SuperGPQA next.

Humanity’s Last Exam

(Link goes to Scale’s leaderboard page for the Exam. Here’s the original paper.)

  • Description: 2,500 expert-written questions across dozens of disciplines, pitched as the “final boss” of closed-ended academic benchmarks. arXiv

  • Why I like it: Raises the bar after GPT-4 steam-rolled legacy exams, offering an early warning system for the next capability jump.

  • Caveats: Breadth trumps depth; many items resemble Olympiad puzzles rather than day-to-day tasks, so wins here won’t always translate to economic impact.

Human preference leaderboards

Chatbot Arena

  • Description: Crowdsourced leaderboard where anonymous human voters pick a winner in blind, head-to-head chat battles, producing a global rank for each LLM.

  • Why I like it: Human judgment—rather than multiple-choice tests—promises a refreshing, real-conversation signal that complements automated benchmarks.

  • Caveats: Recent critiques show the arena can be gamed: labs enter dozens of private model variants and report only the best scores, while optimizing for emojis, flattery (“sycophancy”), and other superficial variables skews voter preferences.

Minecraft Benchmark (MC-Bench)

  • Description: Pits frontier AIs against the same build prompt in a sandboxed Minecraft server; humans vote blind on which blocky creation looks better, yielding an Elo-style leaderboard.

  • Why I like it: Tests many AI skills, including creativity, spatial reasoning, and code-generation. Also, judging a pineapple statue is visual and intuitive, so even non-experts can spot real capability differences without the undue influences of human psychology hacks like flattery.

  • Caveats: Aesthetic preferences, voting demographics, and user knowledge can skew scores (I sometimes don’t even know what the thing they’re generating is supposed to look like). Also, success in a stylized game world may not generalize to physical-world constraints or enterprise workflows.

Aggregators and dashboards

Artificial Analysis

  • Description: Independent dashboard that compares leading LLMs on a composite intelligence index plus speed and price.

  • Why I like it: One glance shows which models deliver the best mix of smarts, latency, and cost.

  • Caveats: The intelligence index is based on underlying benchmarks that can have flaws (e.g., training-data contamination), which then propagate into the computed score.

Epoch AI Benchmarking Hub

  • Description: The nonprofit EpochAI curates a public hub that tracks its own contamination-proof tests (e.g., FrontierMath) alongside independently run benchmarks such as GPQA Diamond, with interactive charts and data downloads.

  • Why I like it: Trusted provider, great charting, and easy data exportation.

  • Caveats: Only a handful of benchmarks and updates can lag model releases.

Scale AI SEAL Leaderboards

  • Description: Scale’s SEAL (Systematic Evaluation & Alignment) leaderboards rank frontier AIs on hidden, expert-graded datasets designed to resist prompt-leakage and training-data contamination.

  • Why I like it: Human evaluators plus undisclosed test sets make the scores much harder to game.

  • Caveats: Datasets and weighting rubric remain proprietary, and participation is opt-in, so results cover only models that submit to Scale’s evaluation pipeline.

Share