If You Can't Measure It, You Can't Ship It - How to Build AI Evals That Actually Work

AI Engineering

The biggest barrier to AI ROI is developers' inability to define what "good" means for their use case.

After a decade building ML infrastructure at Meta , I co-founded Fireworks AI. We now serve 13 trillion tokens daily for thousands of companies like Cursor, Perplexity, Vercel, Uber, and Notion, helping them customize open models for production. Across all these customers, one pattern has become clear: the teams that can build rigorous evaluations are the only ones shipping successful AI systems. Everyone else is just throwing spaghetti at the wall.

Too many teams are shipping AI products entirely based on intuition: manually testing a few examples and hoping for the best. This works early on, but when you venture into specialized domains like contract analysis or insurance formulary navigation, untested edge cases can make applications completely unusable. Models might hallucinate their way to the right answer half the time, but in scrutinized industries, that batting average can end careers.

What separates good evals from bad: I’ll walk through anti-patterns we see repeatedly, like evals that are too easy or that test capabilities instead of requirements. I’ll also cover how to build safety tests that automatically fail when an agent leaks system prompts or attempts destructive queries, alongside functional tests that validate outputs against your requirements..

The Evaluation-Driven Development (EDD) to building AI systems: Code is deterministic, but AI is probabilistic. My EDD approach adapts test-driven development's red-green-refactor cycle for probabilistic systems. I’ll break down how this approach works: developers write a failing evaluation first, make the minimal change to pass it, and refactor while maintaining coverage.

How to translate vague business requirements into measurable assertions: We’ll walk through how to use Eval Protocol, an open-source testing framework that makes AI evaluations work like unit tests, integrating with pytest and CI/CD pipelines. I’ll share examples of how to handle scoring probabilistic outputs, and the tradeoffs between single-turn and multi-turn agent evaluations.

The audience will leave with a practical framework for building evaluation infrastructure that catches problems before they ship and the discipline to never change an AI system without a failing evaluation that demands it.

Co-founder

Benny Chen

Fireworks AI

Benny Chen is a co-founder of global AI inference cloud and infrastructure platform Fireworks AI, which enables teams like Cursor, Uber, Doordash and Shopify to build, tune, and scale highly-optimized generative AI applications. Prior to founding Fireworks, Benny spent a decade building ML infrastructure at Meta.

2026 Talks

If You Can't Measure It, You Can't Ship It - How to Build AI Evals That Actually Work

AI Engineering

The AI Conference for Humans Who Ship