- Monday Momentum
- Posts
- Precision Over Generality
Precision Over Generality
The push to make AI benchmarking more relevant
Happy Monday!
The tech world is laser-focused on the latest AI model releases and capabilities, but we're missing a more fundamental crisis: our ability to meaningfully measure AI progress is breaking down. Standard benchmarks that once challenged top models are now routinely seeing scores above 90%. What happens when our measurement tools can no longer keep up with what we've built? How can we also ensure that we aren’t creating biased benchmarks, and which things are even worth testing as model capabilities diverge?
We're facing a paradigm shift in how we evaluate AI systems. Traditional benchmarks are becoming obsolete because they're testing the wrong things. The future of AI benchmarking lies in measuring real-world economic impact rather than academic metrics.
The Meta Trend
A profound transformation is occurring in how we evaluate AI capabilities. We're moving from synthetic academic tests to real-world economic impact measurements. The goal can’t just be creating harder tests. We need to rethink what we should be measuring in the first place.
Pattern Recognition
Three key developments illuminate this shift:
Standard benchmarks like MMLU (testing academic subjects) and MATH are becoming saturated, with top models scoring over 90%. But these high scores mask significant gaps in real-world problem-solving capabilities. Scale AI's research shows that 78% of models that ace traditional benchmarks still fail basic deployment checks.
OpenAI's new SWE-Lancer benchmark takes a radically different approach: instead of abstract coding tests, it measures a model's ability to earn real money by completing actual freelance software engineering tasks. The benchmark maps directly to economic value, using real market prices to weight task difficulty.
The EU AI Act's upcoming implementation requires third-party benchmarking for high-risk systems with specific coverage requirements - showing how evaluation is moving from academic to regulatory and real-world concerns. Early results suggest over 72% of systems fail initial submissions.
Just as models are diverging in capabilities, testing must diverge to measure more specific and impactful use cases. Most models are now capable of reading, writing, and performing math at a high level. Previous academic measurements will largely become noise, and specific use-case testing is required to differentiate real-world model capabilities.
The Contrarian Take
The obsession with academic benchmarks has led us astray. The AI community's focus on improving test scores has created a false sense of progress. Real capability isn't about acing multiple-choice tests – it's about creating tangible economic and social value.
These methods of measurement have provided ample fuel for the skeptics. It has become much easier to tune a model specifically to perform well on a known benchmark test. When it comes to actual implementation, we are beginning to see all kinds of holes.
Nicolay Gerold recently wrote about RAG and search capability on LinkedIn, highlighting that the increased context windows and search capabilities of new models are often more constrained by reasoning through real-world info and data. The added complexity of putting these models to work is where their benchmark results fail to meet expectations. The real world contains ample nuance, and models must be able to reason through this nuance to be effective.
Practical Implications
For investors and builders in the AI space, this shift has major implications:
The most valuable AI companies won't be those with the highest benchmark scores, but those that can demonstrate concrete economic impact
Investment strategies should focus on measurable business value rather than academic metrics
AI development teams need to reorient their optimization targets from test performance to real-world utility
Due diligence processes for AI investments must evolve to incorporate economic impact measurements
In motion,
Justin Wright
If traditional benchmarks are becoming obsolete, what's the right way to measure AI progress? How do we create evaluation frameworks that capture not just technical capabilities, but real-world impact?

Moving forward, this list will become smaller and hopefully more impactful.
As a brief disclaimer I sometimes include links to products which may pay me a commission for their purchase. I only recommend products I personally use and believe in. The contents of this newsletter are my viewpoints and are not meant to be taken as investment advice in any capacity. Thanks for reading!