Buy Early Bird Tickets for AI Council 2026!

Technical Talks

Rushabh Mehta
Rushabh Mehta
Tech Lead | Meta

Build Trustworthy LLM Apps Powered by Agentic evals

video
Missing value detected...
Video will be populated after the conference

  • Lightning Talks

Agentic AI has certainly secured mindspace of industry. As more & more people continue to play with LLMs, they begin developing an intuition aka vibes around whether LLM is performing the task at hand well or not. While usecases like math problem, classification are easier to verify; generative content creation etc are not. Adding agents to the mix makes things lot more complex. This talk will highlight the particular challenges one may face as they continue their journey towards productionizing their Agentic application: outputs are non deterministic, ground truth is hard to find, and what used to work before no longer does. Yet ""LGTM"" isn't a deployment strategy.

While typical LLM chatbot can be seen as a more sophisticated Google search, we're beginning to expect more from Agents: Don't just give an answer but actually perform the task. Security, bias, privacy etc suddenly become non-negotiable. Only after we handle these complexities, we can unblock real usecases in healthcare, finance, legal etc.

This talk tackles why agent evaluation is fundamentally harder than traditional ML testing: multi-step reasoning chains, tool use side effects & more. How to build evaluation datasets that actually reflect production scenarios, not just cherry-picked examples. We'll cover automated evaluation pipelines using LLM-as-judge patterns, and when you can not avoid human in the loop. The session addresses detecting regressions before users do: setting up continuous evaluation that catches model degradation. Tricky cases when agent aces public evals but fails in production, and how to build evaluations that predict real-world performance.

Rushabh Mehta

Tech Lead

Rushabh Mehta

Meta

Bio Coming Soon