Better Ways to Benchmark AI Today

A practical guide for when leaderboards don’t tell the full story

May 02, 2025

Given the pace of new model releases and controversy over measuring them, people have been asking me: “What benchmarks should I trust now?”

Fair question. Traditional benchmarks are saturated or compromised. Even newer ones like Chatbot Arena—which I’ve long loved—show signs of weakness. And this isn’t just a nerdy technical problem. It affects every business trying to choose AI models or products.

So let’s look at why we benchmark, challenges we face, and what feels to me (admittedly unscientifically) like an emerging consensus on what’s now best.

Why We Benchmark and What’s Gone Wrong

We benchmark AI models to compare them and gauge overall industry progress. Benchmarks typically consist of shared questions or tasks we can give any model, with the assumption that the model hasn’t seen the answers during training.

But today, we’re finding this approach challenged on multiple fronts, including:

Saturation: Some benchmarks, like MMLU, are nearly maxed out. Most frontier models now score 85–90%, leaving little room to tell them apart—or to show meaningful progress.
Contamination: Many benchmarks have leaked into training data. Some models have been caught regurgitating answers word for word. If a test set is no longer “unseen,” its score doesn’t mean much.
Narrowness: Benchmarks often test trivia, logic puzzles, or synthetic tasks—not the messy, practical problems people hire AI to solve. A model might ace a benchmark and still fail at real-world coding, analysis, or support tasks.

Chatbot Arena was meant to solve these problems, garnering much praise. It replaced static tests with head-to-head comparisons: two models, one prompt, and a human vote for the better answer. More real-world, harder to saturate, and less prone to contamination.

But recent findings cast doubt on it, too. A study by Cohere found that some big labs submit models to far more Arena battles, gaining a feedback advantage that let’s them optimize for Chatbot Arena performance. Worse, companies can privately test dozens of model checkpoints (Llama 4: 27), cherry pick the best, and hide the rest.

And even when this doesn’t happen, Chatbot Arena rewards human preferences for sycophantic and emoji-laden responses even over accuracy.

Your Best Bets Right Now

Despite all this, it does feel like the industry’s approach to benchmarking is maturing. We make mistakes, we learn from them, we improve. In my unscientific opinion—based on experience, and reading the tea leaves (and X posts)—here are your top choices:

1. Private, Unpublished Benchmarks

Top AI labs (and many other companies, and even individuals) now keep internal benchmark datasets that are hidden from the public. As an example, OpenAI uses its internal codebase, which it knows isn’t in any public data, for benchmarking purposes.

Internal benchmarks avoid the data contamination problem and reflect domain-specific needs. If you’re deploying AI, your own internal test set—with real, representative tasks—is often the best benchmark there is.

2. Independent Evaluators

Scale’s SEAL initiative runs standardized, private benchmark suites across models, free from training leaks. It’s one of the most credible efforts right now for independent benchmarking. These aren’t cherry-picked or over-optimized.

Other third parties are starting to emerge as well, such as Vals.ai for tasks that mimic industry use cases like those in law. (Know others? Please share in the comments.)

3. Real-World Simulation Benchmarks

We’re seeing an increased interest in real-world simulation benchmarks, including with measurable economic value. These kinds of benchmarks—grounded in real tasks—can tell us more about the value AI will offer to most users.

OpenAI’s SWE-Lancer is a great example: it measures how well models complete actual freelance coding tasks pulled from platforms like Upwork—complete with dollar values attached. Performance translates directly to economic utility.

4. Complex, Blinded Human Evaluations

Benchmarks like MC-Bench (Minecraft) have emerged that require models to perform tasks involving multiple elements like instruction following, world knowledge, and spatial reasoning. On MC-Bench, users ask models to build things in Minecraft, then vote on the output without knowing which model made what. This lets models’ creativity, reasoning, and execution speak for themselves.

I’m a bit more concerned about MC-Bench after learning of the issues with Chatbot Arena, which uses a similar blinded head-to-head format. But I’m also hopeful that the team behind MC-Bench will avoid those issues, and also think it’s harder to game people with things like emojis and sycophancy on a Minecraft build.

5. Rankings from Real Usage

A simple way to know which models are best is to look at what users actually choose. Tools like Cursor (AI-powered IDE) and OpenRouter (centralized AI endpoint) serve multiple models and compile data on their usage. OpenRouter even breaks this down by category.

Of course, usage isn’t perfect. It’s skewed, for example, by pricing and defaults. But it better reflects real preferences, and we can control for factors like pricing to provide an even better understanding of model choice.

6. Aggregators

Services like Artificial Analysis compile many benchmarks into a single dashboard, which can help make it easier to compare different models.

I’m generally a fan of Artificial Analysis, but it’s important to note that its “intelligence” metric aggregates results of underlying benchmarks that can suffer from saturation, contamination, and narrowness.

So, while such sites are helpful for high-level comparisons, don’t rely on them exclusively.

Future Fix and Going Beyond the Model

While the six approaches above are the best options we have today, they’re still imperfect. What we’re missing is a neutral, industry-backed benchmarking body—something like an Underwriters Laboratories for AI. It would design evaluations that evolve with the technology, enforce testing protocols agreed upon by multiple players, and reduce room for cherry-picking.

Meanwhile, it’s worth remembering something important: AI models aren’t the whole product. Unless you’re deploying models yourself for a specific task, you’re likely using them within a product. And there, design, features, and fine-tuning can matter even more.

So while benchmarks matter, they’re not all that matters. Look at real-world performance. Run your own tests. And when in doubt, trust what works for you, not what ranks highest.

Simon Smith on AI

Discussion about this post