The results aren’t reproducible
Here is a fun experiment: look up GPT-4o’s MMLU-Pro score across different sources.- According to the original paper? 72.6%
- According to Kaggle Open Benchmarks? 73.0% ±0.8%
- According to LLM Stats? 85.7%
- According to Artificial Analysis? 74.0% (GPT-4o May), 74.8% (GPT-4o Nov), or 77.3% (GPT-4o ChatGPT)
The top models differ by rounding errors
This gets especially absurd when you look at the top performers. On Kaggle’s Open Benchmarks leaderboard, the top three models differ by exactly 1%:- Claude Opus 4.1 (2025-08-05): 87.9%
- GPT-5 (2025-08-07): 87.1%
- Claude Opus 4 (2025-05-14): 86.9%
Why does this happen?
Different leaderboards use:- Different prompting strategies
- Different evaluation protocols (how they parse answers)
- Different testing conditions
What this means for you
If you choose an LLM based on MMLU scores, you essentially pick based on vibes. The model that’s “#1” on one leaderboard is #4 on another. The solution? Test models on your actual use case. With your data. Under your conditions.Running your own benchmark shifts everything
Want to see how much the rankings change when you test on real tasks instead of MMLU? This analysis ran a simple benchmark using 100 articles from NewsAPI about Australia, Sydney, and Melbourne, asking models to extract author names from the HTML. A straightforward task, but one that requires understanding real-world data structure. The results completely flipped the leaderboard. See the full analysis here. But here is the kicker - GPT-3.5 Turbo (from 2023) was comparable with every flagship 2025 model, and beat the MMLU Pro champion Claude Opus 4.1. That’s the only number that matters—the one that reflects your actual use case.Build your own benchmark in 5 minutes → Start testing for free