Benchmarks I Like
Last updated June 22, 2025
AI evaluation is a maze. Some tests look rigorous but collapse under data-leak scrutiny; some leaderboards reward emoji spam instead of real skill. To keep myself—and anyone who visits—focused on signal, I curate some resources here:
Real-world impact benchmarks: Tasks that translate directly into economic output or long-horizon autonomy. They tell us how close we are to agentic systems that earn or save money without hand-holding.
Frontier capability benchmarks: Fresh, contamination-resistant challenges. They stretch today’s best models and flag genuine capability jumps, even when headline scores on older exams have plateaued.
Human-preference leaderboards: Live, blinded evaluations where human voters pick their favorite output. They can surface subjective qualities that automated metrics can’t capture, but they’re also vulnerable to crowd bias and gaming
Aggregators and dashboards: Composite indexes and one-stop scoreboards that pull many tests into one view, useful for quick model comparisons—provided you understand the tests behind the numbers.
Usage benchmarks: Live-deployment metrics—like token volumes and PR-merge rates—that show which models and agents people actually choose and trust in production.
Other benchmarks: Interesting ways of evaluating model capabilities beyond standard benchmarks, leaderboards, aggregators, and dashboards.
For each entry you’ll see what it is, why I value it, and watch-outs so you can judge results in context. I update the list whenever a new benchmark proves its worth or an old favorite shows cracks.
Real-world impact benchmarks
CRMArena-Pro
Description: Salesforce’s CRMArena-Pro stress-tests LLM agents on 19 expert-validated tasks that mirror daily CRM work—customer service, sales, and CPQ (configure, price, quote)—across both B2B and B2C orgs. The benchmark injects multi-turn dialogue, sandboxed Salesforce data, and “don’t-leak-this” confidentiality checks.
Why I like it: It measures skills that drive revenue—routing leads, fixing quote errors, mining sales calls—so scores translate to dollars saved or earned.
Caveats: The sandbox is synthetic; real Salesforce orgs hold messier data and idiosyncratic automations.
HealthBench
Description: OpenAI’s HealthBench tests LLMs on 5,000 realistic health conversations scored against 48,562 physician-written rubric criteria developed in partnership with 262 physicians. It spans themes like emergency care, global health, and clinical communication, and evaluates five core behaviors: accuracy, completeness, context awareness, communication, and instruction following.
Why I like it: It reflects real-world use. Instead of multiple choice, it uses open-ended tasks grounded in expert judgment, which addresses a criticism of less realistic multiple choice medical benchmarks. The “Consensus” and “Hard” subsets isolate safety-critical errors and stress-test model limits.
Caveats: Most examples are synthetic. Scoring relies on model graders, which, while benchmarked against physicians, aren’t infallible. And it scores single-turn outputs—real workflows often need multi-step exchanges.
SWE-Lancer
Description: Corpus of real Upwork software-engineering gigs—from $50 bug fixes to $30K feature builds—framed as autonomous tasks for agents.
Why I like it: Directly measures economic value: can an AI earn freelance income without human help?
Caveats: Evaluation still happens offline; no live leaderboard yet, so reproducibility depends on private re-runs.
METR Long-Horizon Time-Horizon Evaluation
Description: Tracks how many sequential coding steps an agent can complete before failure, updated quarterly.
Why I like it: Offers a single, intuitive number—task length—that rises as autonomy improves.
Caveats: Currently limited to software projects; extrapolating the same “steps-to-failure” curve to other domains is still speculative.
Vending-Bench
Description: Simulated vending-machine business that forces an agent to plan inventory, prices, and restocking over thousands of steps.
Why I like it: Captures long-horizon coherence and profit-and-loss thinking—skills most toy tasks ignore.
Caveats: Environment abstraction hides messy real-world variables (e.g. logistics, theft, hardware faults), so results may overestimate deploy-time robustness.
Vectara Hallucination Leaderboard
Description: Tracks answer-level hallucination rates for dozens of LLMs by posing factual questions, forcing each model to cite sources, and auto-checking those citations against ground-truth passages.
Why I like it: Zeroes in on a single, business-critical failure mode—making things up—and reports a clean percentage score, so I can spot trustworthy candidates for retrieval-augmented workflows at a glance.
Caveats: Uses a fixed question set and static grounding corpus; models fine-tuned on that data or citation format can overfit, and a “correct citation” doesn’t always capture subtler factual slips or outdated information.
Frontier capability benchmarks
FrontierMath
Description: Epoch AI’s closed-source set of fresh Olympiad-style math problems that have never appeared online, preventing training-data leakage.
Why I like it: Provides a contamination-free yardstick for pure mathematical reasoning progress.
Caveats: Math prowess matters, but many enterprise use-cases hinge more on language, multimodal, and agentic skills, so real-world relevance is indirect.
GPQA
Description: Graduate-level multiple-choice science questions designed to thwart pure retrieval and test reasoning.
Why I like it: Mirrors the kind of conceptual problem-solving scientists face—good proxy for real-world research tasks.
Caveats: Almost saturated; top models already beat humans and exceed 80%; we should consider the successor SuperGPQA next.
Humanity’s Last Exam
Description: 2,500 expert-written questions across dozens of disciplines, pitched as the “final boss” of closed-ended academic benchmarks. arXiv
Why I like it: Raises the bar after GPT-4 steam-rolled legacy exams, offering an early warning system for the next capability jump.
Caveats: Breadth trumps depth; many items resemble Olympiad puzzles rather than day-to-day tasks, so wins here won’t always translate to economic impact.
Human preference leaderboards
Chatbot Arena
Description: Crowdsourced leaderboard where anonymous human voters pick a winner in blind, head-to-head chat battles, producing a global rank for each LLM.
Why I like it: Human judgment—rather than multiple-choice tests—promises a refreshing, real-conversation signal that complements automated benchmarks.
Caveats: Recent critiques show the arena can be gamed: labs enter dozens of private model variants and report only the best scores, while optimizing for emojis, flattery (“sycophancy”), and other superficial variables skews voter preferences.
Minecraft Benchmark (MC-Bench)
Description: Pits frontier AIs against the same build prompt in a sandboxed Minecraft server; humans vote blind on which blocky creation looks better, yielding an Elo-style leaderboard.
Why I like it: Tests many AI skills, including creativity, spatial reasoning, and code-generation. Also, judging a pineapple statue is visual and intuitive, so even non-experts can spot real capability differences without the undue influences of human psychology hacks like flattery.
Caveats: Aesthetic preferences, voting demographics, and user knowledge can skew scores (I sometimes don’t even know what the thing they’re generating is supposed to look like). Also, success in a stylized game world may not generalize to physical-world constraints or enterprise workflows.
SciArena
Description: Open, crowdsourced leaderboard for evaluating foundation models on scientific literature tasks. Researchers submit scientific questions, compare blind, literature-grounded responses from two models, and vote for the better one. Results update an Elo-style leaderboard.
Why I like it: This brings Chatbot Arena’s human-preference paradigm into science, augmenting static science benchmarks like GPQA. SciArena launched with a rigorously controlled evaluation: 102 expert annotators—each with peer-reviewed publications—evaluated model answers in blind, head-to-head comparisons. Their judgments showed strong agreement and high self-consistency, giving real weight to results. Votes are now open to others.
Caveats: SciArena focuses on standard foundation models, not agentic tools like OpenAI’s Deep Research. It also uses fixed retrieval and prompting pipelines, which can influence model performance—future iterations may vary these components for a fuller picture.
Aggregators and dashboards
Artificial Analysis
Description: Independent dashboard that compares leading LLMs on a composite intelligence index plus speed and price.
Why I like it: One glance shows which models deliver the best mix of smarts, latency, and cost.
Caveats: The intelligence index is based on underlying benchmarks that can have flaws (e.g., training-data contamination), which then propagate into the computed score.
Epoch AI Benchmarking Hub
Description: The nonprofit EpochAI curates a public hub that tracks its own contamination-proof tests (e.g., FrontierMath) alongside independently run benchmarks such as GPQA Diamond, with interactive charts and data downloads.
Why I like it: Trusted provider, great charting, and easy data exportation.
Caveats: Only a handful of benchmarks and updates can lag model releases.
Scale AI SEAL Leaderboards
Description: Scale’s SEAL (Systematic Evaluation & Alignment) leaderboards rank frontier AIs on hidden, expert-graded datasets designed to resist prompt-leakage and training-data contamination.
Why I like it: Human evaluators plus undisclosed test sets make the scores much harder to game.
Caveats: Datasets and weighting rubric remain proprietary, and participation is opt-in, so results cover only models that submit to Scale’s evaluation pipeline.
Usage benchmarks
OpenRouter Rankings
Description: OpenRouter’s leaderboard aggregates raw token traffic across hundreds of connected models, sliceable by task category (programming, marketing, finance, etc.). The list updates in near-real time and lets you see which models get the most use.
Why I like it: Dollars follow tokens. High-volume models are the ones developers stake real workloads on, so the chart offers a market-revealed preference signal that complements capability scores.
Caveats: Reflects only usage that flows through OpenRouter’s API, so it undercounts direct-to-vendor traffic and closed-source enterprise deployments. Price can also impact usage, so if vendors offer discounted or even free tokens, it can skew the rankings.
PRArena
Description: A continuously updated dashboard that tallies every pull-request an AI coding agent opens on public GitHub repositories, then tracks whether maintainers merge or reject it. The table and time-series charts surface total PRs, merged PRs, and per-agent success rates.
Why I like it: It captures real engineering tasks and a hard-to-game outcome (merge). If an agent wins review and merges code, that’s proof of production-grade value—no synthetic tasks, no hidden test set.
Caveats: Only covers public repos; enterprise usage stays invisible. Merge odds also reflect repo culture and human reviewer bias, so success rates aren’t agent-only signal.
Other benchmarks
Expert Model Comparison
Description: Experts (such as scientists and mathematicians) evaluate blinded model outputs on the same prompt, ranking which they prefer.
Why I like it: You get trustworthy judgment that avoids issues like sycophancy bias seen in non-expert human preference leaderboards like Chatbot Arena.
Caveats: Expensive, slow, and hard to scale. Depends heavily on the quality and calibration of your expert pool.
Private Benchmarks
Description: Custom evaluation sets tailored to your business or use case. You feed models the same proprietary tasks—code snippets, emails, workflows, whatever—and compare performance across models or versions.
Why I like it: You get signal on what actually matters to you. No leaderboard score can substitute for performance on your own tasks.
Caveats: Creating and maintaining a representative benchmark takes work.
Split-Testing
Description: Run A/B tests on model-generated outputs in production—e.g., email subject lines, ad copy, onboarding flows—and measure actual outcomes like click-throughs or conversions.
Why I like it: Ground truth doesn't lie. If Model A converts better than Model B, that’s all you need to know.
Caveats: Only works in live systems. You need volume, instrumentation, and statistical discipline. Can also be confounded by non-model factors (timing, segment, etc.).