The MMLU benchmark reproducibility problem

In summary: Generative Pre-trained Transformer (GPT)-4o’s MMLU-Pro score varies by 13 percentage points depending on who measures it. Meanwhile, the “top” three models differ by just 1%. The numbers you use to pick your LLM are essentially meaningless.

MMLU Pro is one of the hottest benchmarks in AI, measuring “a text model’s multitask accuracy” according to the original Arxiv paper from 2024. As the authors state: “To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.” But when the same model scores differently by 13 percentage points across different sources, the scores lose their meaning.

The results aren’t reproducible

Here is a fun experiment: look up GPT-4o’s MMLU-Pro score across different sources.

According to the original paper? 72.6%
According to Kaggle Open Benchmarks? 73.0% ±0.8%
According to LLM Stats? 85.7%
According to Artificial Analysis? 74.0% (GPT-4o May), 74.8% (GPT-4o Nov), or 77.3% (GPT-4o ChatGPT)

So, which one is real? The answer is none of them. All of them. It doesn’t matter. Are they repeatable? Hard to tell. Each measurement is different - that’s certain.

The top models differ by rounding errors

This gets especially absurd when you look at the top performers. On Kaggle’s Open Benchmarks leaderboard, the top three models differ by exactly 1%:

Claude Opus 4.1 (2025-08-05): 87.9%
GPT-5 (2025-08-07): 87.1%
Claude Opus 4 (2025-05-14): 86.9%

Given that GPT-4o’s score swings by 13 percentage points depending on who measures it, that 1% difference is literally noise. It looks like GPT-4o could be anywhere on this benchmark.

Why does this happen?

Different leaderboards use:

Different prompting strategies
Different evaluation protocols (how they parse answers)
Different testing conditions

The “standardized” benchmark isn’t standardized at all.

What this means for you

If you choose an LLM based on MMLU scores, you essentially pick based on vibes. The model that’s “#1” on one leaderboard is #4 on another. The solution? Test models on your actual use case. With your data. Under your conditions.

Running your own benchmark shifts everything

Want to see how much the rankings change when you test on real tasks instead of MMLU? This analysis ran a simple benchmark using 100 articles from NewsAPI about Australia, Sydney, and Melbourne, asking models to extract author names from the HTML. A straightforward task, but one that requires understanding real-world data structure. The results completely flipped the leaderboard. See the full analysis here. But here is the kicker - GPT-3.5 Turbo (from 2023) was comparable with every flagship 2025 model, and beat the MMLU Pro champion Claude Opus 4.1. That’s the only number that matters—the one that reflects your actual use case.

Build your own benchmark in 5 minutes → Start testing for free

Tested the 'top of 2025' LLMs on a real task. GPT-3.5 won.AI adoption rate for large firms continues to trend down

Blog

​The results aren’t reproducible

​The top models differ by rounding errors

​Why does this happen?

​What this means for you

​Running your own benchmark shifts everything

The results aren’t reproducible

The top models differ by rounding errors

Why does this happen?

What this means for you

Running your own benchmark shifts everything