Welcome to Our Data Benchmark, Where Everything's Made Up and the Points Don't Matter

Izzy Miller

AI Engineer | Hex

Analytics & Data Sci

SPIDER, DSBench, and every other analytics benchmark treat data work like a pub quiz: here's a question, here's the answer, did you match? But real analytics is arguing about whether "revenue" means bookings or collections, discovering that Stripe amounts are in cents while your platform stores dollars, and figuring out why the numbers don't tie to last quarter's deck. Current benchmarks can't express any of that—they just check if you got 47.3. Worse: smarter models keep making the same mistakes. Opus is clearly more intelligent than Sonnet, but it falls into the same traps—path of least resistance, accepts the first answer, doesn't ask clarifying questions. We'll show specific examples where industry standard benchmarks fail (including our own) and share some ideas for evals that test what analysts actually do: learn a messy warehouse over time, not answer a frozen question on day zero.

AI Engineer

Izzy Miller

Hex

Izzy leads AI research at Hex, with a special focus on evaluation and experimentation. In previous lives, he was a botanist, rock climber, circus clown, and various other economically infeasible roles that he's keen to return to once AI takes his current job.

2026 Talks

Welcome to Our Data Benchmark, Where Everything's Made Up and the Points Don't Matter

Analytics & Data Sci

The AI Conference for Humans Who Ship