Benchmarks I Like

Last updated October 3, 2025

AI evaluation is a maze. Some tests look rigorous but collapse under data-leak scrutiny; some leaderboards reward emoji spam instead of real skill. To keep myself—and anyone who visits—focused on signal, I curate some resources here:

Real-world impact benchmarks: Tasks that translate directly into economic output or long-horizon autonomy. They tell us how close we are to agentic systems that earn or save money without hand-holding.
Frontier capability benchmarks: Fresh, contamination-resistant challenges. They stretch today’s best models and flag genuine capability jumps, even when headline scores on older exams have plateaued.
Human-preference leaderboards: Live, blinded evaluations where human voters pick their favorite output. They can surface subjective qualities that automated metrics can’t capture, but they’re also vulnerable to crowd bias and gaming
Aggregators and dashboards: Composite indexes and one-stop scoreboards that pull many tests into one view, useful for quick model comparisons—provided you understand the tests behind the numbers.
Usage benchmarks: Live-deployment metrics—like token volumes and PR-merge rates—that show which models and agents people actually choose and trust in production.
Other benchmarks: Interesting ways of evaluating model capabilities beyond standard benchmarks, leaderboards, aggregators, and dashboards.

For each entry you’ll see what it is, why I value it, and watch-outs so you can judge results in context. I update the list whenever a new benchmark proves its worth or an old favorite shows cracks.

Real-world impact benchmarks

AI Productivity Index (APEX)

Description: Mercor’s APEX benchmark measures how well AI models can perform high-value knowledge work. It includes 200 tasks across investment banking, management consulting, law, and primary medical care. Each task, designed by vetted professionals from top firms and hospitals, takes 1–8 hours of expert-level effort and is graded against detailed rubrics with pass/fail criteria.
Why I like it: Like GDPval (see below), APEX goes beyond narrow benchmarks disconnected from real-world impact to evaluate deliverables that actually provide economic value in elite professions. And the rubric-based design makes it more reproducible than GDPval’s human reviews, as well as more scalable because AIs can use it to evaluate outputs.
Caveats: Coverage is narrow, encompassing just four professions, compared with broader benchmarks like GDPval. Also, while rubric-based AI grading is more scalable, it risks underestimating AI’s economic potential. Even experts don’t always produce perfect deliverables. AI doesn’t have to be perfect, just as good as or better than experts. GDPval can better determine this by having expert reviewers blindly compare expert and AI output and deciding which is better. This said, the two benchmarks may be good complements.

CRMArena-Pro

Description: Salesforce’s CRMArena-Pro stress-tests LLM agents on 19 expert-validated tasks that mirror daily CRM work—customer service, sales, and CPQ (configure, price, quote)—across both B2B and B2C orgs. The benchmark injects multi-turn dialogue, sandboxed Salesforce data, and “don’t-leak-this” confidentiality checks.
Why I like it: It measures skills that drive revenue—routing leads, fixing quote errors, mining sales calls—so scores translate to dollars saved or earned.
Caveats: The sandbox is synthetic; real Salesforce orgs hold messier data and idiosyncratic automations.

GDPval

Description: OpenAI’s GDPval tests models on real-world, economically valuable tasks drawn from multiple occupations. Deliverables span spreadsheets, CAD files, presentations, legal briefs, and even multimedia, each built from actual expert workflows averaging 7–9 hours of human effort.
Why I like it: It ties model performance directly to US wages, so results map cleanly to economic value. Seeing where models fail on this benchmark will also help steer research towards more economically impactful progress.
Caveats: At the time of launch, it didn’t capture all aspects of work, such as interactive use cases.

HealthBench

Description: OpenAI’s HealthBench tests LLMs on 5,000 realistic health conversations scored against 48,562 physician-written rubric criteria developed in partnership with 262 physicians. It spans themes like emergency care, global health, and clinical communication, and evaluates five core behaviors: accuracy, completeness, context awareness, communication, and instruction following.
Why I like it: It reflects real-world use. Instead of multiple choice, it uses open-ended tasks grounded in expert judgment, which addresses a criticism of less realistic multiple choice medical benchmarks. The “Consensus” and “Hard” subsets isolate safety-critical errors and stress-test model limits.
Caveats: Most examples are synthetic. Scoring relies on model graders, which, while benchmarked against physicians, aren’t infallible. And it scores single-turn outputs—real workflows often need multi-step exchanges.

METR Long-Horizon Time-Horizon Evaluation

Description: Tracks how many sequential coding steps an agent can complete before failure, updated quarterly.
Why I like it: Offers a single, intuitive number—task length—that rises as autonomy improves.
Caveats: Currently limited to software projects; extrapolating the same “steps-to-failure” curve to other domains is still speculative.

Prophet Arena

Description: Live, contamination-proof forecasting benchmark. Models pull curated context, output probabilistic forecasts, and get scored when real events resolve. Leaderboard tracks accuracy (1 − Brier) and average returns from simulated bets. Market baselines show human performance for comparison.
Why I like it: Tests predictive intelligence on unresolved, real-world questions. Future events eliminate data contamination. Accuracy and returns capture different capabilities, and models don’t always top both.
Caveats: Questions mirror those that are common in prediction markets, covering topics like politics, sports, and crypto. So doesn’t cover all topics that might be of interest.

SWE-Lancer

Description: Corpus of real Upwork software-engineering gigs—from $50 bug fixes to $30K feature builds—framed as autonomous tasks for agents.
Why I like it: Directly measures economic value: can an AI earn freelance income without human help?
Caveats: Evaluation still happens offline; no live leaderboard yet, so reproducibility depends on private re-runs.

Vectara Hallucination Leaderboard

Description: Tracks answer-level hallucination rates for dozens of LLMs by posing factual questions, forcing each model to cite sources, and auto-checking those citations against ground-truth passages.
Why I like it: Zeroes in on a single, business-critical failure mode—making things up—and reports a clean percentage score, so I can spot trustworthy candidates for retrieval-augmented workflows at a glance.
Caveats: Uses a fixed question set and static grounding corpus; models fine-tuned on that data or citation format can overfit, and a “correct citation” doesn’t always capture subtler factual slips or outdated information.

Vending-Bench

Description: Simulated vending-machine business that forces an agent to plan inventory, prices, and restocking over thousands of steps.
Why I like it: Captures long-horizon coherence and profit-and-loss thinking—skills most toy tasks ignore.
Caveats: Environment abstraction hides messy real-world variables (e.g. logistics, theft, hardware faults), so results may overestimate deploy-time robustness.

Frontier capability benchmarks

FrontierMath

Description: Epoch AI’s closed-source set of fresh Olympiad-style math problems that have never appeared online, preventing training-data leakage.
Why I like it: Provides a contamination-free yardstick for pure mathematical reasoning progress.
Caveats: Math prowess matters, but many enterprise use-cases hinge more on language, multimodal, and agentic skills, so real-world relevance is indirect.

GPQA

Description: Graduate-level multiple-choice science questions designed to thwart pure retrieval and test reasoning.
Why I like it: Mirrors the kind of conceptual problem-solving scientists face—good proxy for real-world research tasks.
Caveats: Almost saturated; top models already beat humans and exceed 80%; we should consider the successor SuperGPQA next.

Humanity’s Last Exam

Description: 2,500 expert-written questions across dozens of disciplines, pitched as the “final boss” of closed-ended academic benchmarks. arXiv
Why I like it: Raises the bar after GPT-4 steam-rolled legacy exams, offering an early warning system for the next capability jump.
Caveats: Breadth trumps depth; many items resemble Olympiad puzzles rather than day-to-day tasks, so wins here won’t always translate to economic impact.

Human preference leaderboards

Chatbot Arena

Description: Crowdsourced leaderboard where anonymous human voters pick a winner in blind, head-to-head chat battles, producing a global rank for each LLM.
Why I like it: Human judgment—rather than multiple-choice tests—promises a refreshing, real-conversation signal that complements automated benchmarks.
Caveats: Recent critiques show the arena can be gamed: labs enter dozens of private model variants and report only the best scores, while optimizing for emojis, flattery (“sycophancy”), and other superficial variables skews voter preferences.

Minecraft Benchmark (MC-Bench)

Description: Pits frontier AIs against the same build prompt in a sandboxed Minecraft server; humans vote blind on which blocky creation looks better, yielding an Elo-style leaderboard.
Why I like it: Tests many AI skills, including creativity, spatial reasoning, and code-generation. Also, judging a pineapple statue is visual and intuitive, so even non-experts can spot real capability differences without the undue influences of human psychology hacks like flattery.
Caveats: Aesthetic preferences, voting demographics, and user knowledge can skew scores (I sometimes don’t even know what the thing they’re generating is supposed to look like). Also, success in a stylized game world may not generalize to physical-world constraints or enterprise workflows.

SciArena

Description: Open, crowdsourced leaderboard for evaluating foundation models on scientific literature tasks. Researchers submit scientific questions, compare blind, literature-grounded responses from two models, and vote for the better one. Results update an Elo-style leaderboard.
Why I like it: This brings Chatbot Arena’s human-preference paradigm into science, augmenting static science benchmarks like GPQA. SciArena launched with a rigorously controlled evaluation: 102 expert annotators—each with peer-reviewed publications—evaluated model answers in blind, head-to-head comparisons. Their judgments showed strong agreement and high self-consistency, giving real weight to results. Votes are now open to others.
Caveats: SciArena focuses on standard foundation models, not agentic tools like OpenAI’s Deep Research. It also uses fixed retrieval and prompting pipelines, which can influence model performance—future iterations may vary these components for a fuller picture.

Aggregators and dashboards

Artificial Analysis

Description: Independent dashboard that compares leading LLMs on a composite intelligence index plus speed and price.
Why I like it: One glance shows which models deliver the best mix of smarts, latency, and cost.
Caveats: The intelligence index is based on underlying benchmarks that can have flaws (e.g., training-data contamination), which then propagate into the computed score.

Epoch AI Benchmarking Hub

Description: The nonprofit EpochAI curates a public hub that tracks its own contamination-proof tests (e.g., FrontierMath) alongside independently run benchmarks such as GPQA Diamond, with interactive charts and data downloads.
Why I like it: Trusted provider, great charting, and easy data exportation.
Caveats: Only a handful of benchmarks and updates can lag model releases.

Scale AI SEAL Leaderboards

Description: Scale’s SEAL (Systematic Evaluation & Alignment) leaderboards rank frontier AIs on hidden, expert-graded datasets designed to resist prompt-leakage and training-data contamination.
Why I like it: Human evaluators plus undisclosed test sets make the scores much harder to game.
Caveats: Datasets and weighting rubric remain proprietary, and participation is opt-in, so results cover only models that submit to Scale’s evaluation pipeline.

Usage benchmarks

OpenRouter Rankings

Description: OpenRouter’s leaderboard aggregates raw token traffic across hundreds of connected models, sliceable by task category (programming, marketing, finance, etc.). The list updates in near-real time and lets you see which models get the most use.
Why I like it: Dollars follow tokens. High-volume models are the ones developers stake real workloads on, so the chart offers a market-revealed preference signal that complements capability scores.
Caveats: Reflects only usage that flows through OpenRouter’s API, so it undercounts direct-to-vendor traffic and closed-source enterprise deployments. Price can also impact usage, so if vendors offer discounted or even free tokens, it can skew the rankings.

PRArena

Description: A continuously updated dashboard that tallies every pull-request an AI coding agent opens on public GitHub repositories, then tracks whether maintainers merge or reject it. The table and time-series charts surface total PRs, merged PRs, and per-agent success rates.
Why I like it: It captures real engineering tasks and a hard-to-game outcome (merge). If an agent wins review and merges code, that’s proof of production-grade value—no synthetic tasks, no hidden test set.
Caveats: Only covers public repos; enterprise usage stays invisible. Merge odds also reflect repo culture and human reviewer bias, so success rates aren’t agent-only signal.

Other benchmarks

Expert Model Comparison

Description: Experts (such as scientists and mathematicians) evaluate blinded model outputs on the same prompt, ranking which they prefer.
Why I like it: You get trustworthy judgment that avoids issues like sycophancy bias seen in non-expert human preference leaderboards like Chatbot Arena.
Caveats: Expensive, slow, and hard to scale. Depends heavily on the quality and calibration of your expert pool.

Private Benchmarks

Description: Custom evaluation sets tailored to your business or use case. You feed models the same proprietary tasks—code snippets, emails, workflows, whatever—and compare performance across models or versions.
Why I like it: You get signal on what actually matters to you. No leaderboard score can substitute for performance on your own tasks.
Caveats: Creating and maintaining a representative benchmark takes work.

Split-Testing

Description: Run A/B tests on model-generated outputs in production—e.g., email subject lines, ad copy, onboarding flows—and measure actual outcomes like click-throughs or conversions.
Why I like it: Ground truth doesn't lie. If Model A converts better than Model B, that’s all you need to know.
Caveats: Only works in live systems. You need volume, instrumentation, and statistical discipline. Can also be confounded by non-model factors (timing, segment, etc.).