To Eval or not, that CAIO question
A CAIO at a mid-size firm reached out after watching the Twitter saga unfold
"Should we invest in eval infrastructure or not?"
The question surprised me.
I'd just completed Shreya and Hamel's comprehensive course on AI evaluation last month, thinking the value was settled. And yet, here we were, the AI builder community split into warring camps, half saying evals are essential, half saying they're a waste.
When people say evals, unit testing is an approximate metaphor but they mean more, they mean evaluation systems: the scorecards for models. Do they reason correctly? Follow instructions? Avoid mistakes? Stay consistent?
Think of them as medical checkups for AI.
Early on, you just need to know if the patient is breathing. Later, when millions depend on it, you need the full diagnostic lab to catch small problems before they spiral. So the real question isn't whether evals are good or bad. It's: at what stage do they matter?
This debate reminded me of earlier methodology debates in startups. First it was "lean startup", suddenly every company needed customer development and MVPs. Soon scrappy became crappy. Then "design thinking" became the gospel, with its post-it note rituals and empathy maps. Now it's "evals", rigorous measurement and systematic optimization for every AI product. Each time, the advice sounds sensible until you notice something, the people giving it aren't the people who need it most.
Big companies love methodologies because they have a different problem than startups. When you're Google or Facebook, you have millions of users and need to optimize incrementally without breaking what works. When you're OpenAI shipping GPT-4, systematic evaluation prevents regressions across thousands of use cases. Design thinking makes sense when you have resources to run proper user research. Lean startup helps when you're a large organization trying to act more entrepreneurial.
But young companies don't need design thinking, they need to build something magical for their users. They don't need the lean startup methodology, they need to move fast and break things until something works. And they don't need comprehensive evaluation frameworks, they need to ship and see if users go crazy using their product. Successful big companies systematize what made them successful, then others sell those systems to companies that haven't succeeded yet. It's like trying to teach someone to run by giving them a marathon training plan.
The problem is timing.
When you're finding product-market fit, evaluation can become procrastination. You don't yet know what "good" looks like, so you can't meaningfully measure it. Worse, building evaluation infrastructure pulls engineering resources away from the core loop: ship, learn, iterate. I've seen startups spend weeks building sophisticated A/B testing frameworks before they had a single happy user. They were optimizing for a problem they hadn't solved yet. It's like running DEXTRA scans on a baby.
Evals are like hiring a VP of Sales. Premature optimization that can actually hurt you if you do it too early, but essential once you're ready for it.
The companies that progress well early aren't the ones with the best measurement, they're the ones with the fastest learning cycles. "Vibe-shipping" that everyone mocks is actually the right strategy when you're still figuring out what to build.
But here's where the eval advocates are absolutely right: once you know people want your product, systematic improvement becomes critical. You need to enhance what works without breaking it. You need to catch regressions before users do. You need data to guide optimization decisions. Now regression is the risk. What delighted people yesterday can be broken by a careless change today.
OpenAI needs comprehensive evals because they're serving millions of users across thousands of use cases. A 2% regression in reasoning capability affects every customer. The cost of systematic evaluation is tiny compared to the cost of shipping broken models. At this stage, evals are no longer overhead. They're insurance. Evals become oxygen only after you've lit the fire.
The real problem isn't evals themselves, it's the category error. The evaluation industry is selling microscopes to people who need binoculars. Pre-PMF companies need to see the big picture fast, not measure tiny improvements with scientific precision. They need pulse checkups, not MRI scans.
Eval platforms burn money trying to convert the wrong customers. Pre-PMF startups get confused about priorities. Engineers optimize metrics instead of building something users love.
The irony is that eval companies would be better off being honest about this timing. Instead of trying to sell to everyone, they should tell pre-PMF companies to wait. Slack famously told small teams to use IRC first. Snowflake told companies without data teams to start simpler. Both became category leaders partly because they avoided bad-fit customers who would have diluted their focus.
So what did I tell my CAIO friend? "Don't even worry about talking to users right now. Build something magical and see if users go crazy using it."
This sounds like heresy in our customer-development obsessed world. But early-stage AI products are different. The best ones create new behaviors rather than optimizing existing ones. Instagram didn't survey users about photo filters, they built something delightful and watched people share millions of filtered photos. ChatGPT didn't A/B test conversation interfaces, they shipped and watched the world change how it thinks about AI.
When you're building something truly new, user interviews often mislead more than they help. Users can't articulate needs they don't know they have. They'll tell you they want a faster horse when you're trying to build a car. Early on, metrics and interviews alike will mislead you. People can't tell you what they'll love until they feel it.
Build something that makes you say "holy shit, this is incredible." If you're not excited by your own product, users won't be either. Ship it, and if users don't immediately start doing unexpected things with it, you probably don't have magic yet. Your job is to build until you surprise yourself.
Once you see that organic usage explosion, that's when you need evals. That's when you need to systematically improve without breaking the magic. But not before.
I'm long on evals, but at the right stage. They're essential infrastructure for scaling AI products, not for finding them. The companies that try to eval their way to product-market fit will optimize themselves into irrelevance. The companies that wait until they have something worth optimizing will build the evaluation systems that actually matter.
Because before product-market fit, evals is motion not progress.
