Video streams combine vision, audio, time-series and semantics at a scale and complexity unlike text alone. At TwelveLabs, we’ve found that tackling this challenge doesn’t start with ever-bigger models — it starts with engineering the right context. In this session, we’ll walk engineers and infrastructure leads through how to build production-grade video AI by systematically designing what information the model receives, how it's selected, compressed, and isolated. You’ll learn our four pillars of video context engineering (Write, Select, Compress, Isolate), see how our foundation models (Marengo & Pegasus) and agent product (Jockey) use them, and review real-world outcomes in media, public-safety and advertising pipelines. We’ll also dive into how you measure context effectiveness — tokens per minute, retrieval hit rates, versioned context pipelines — and how this insight drives cost, latency and trust improvements. If you’re deploying AI video solutions in the wild, you’ll leave with a blueprint for turning raw video into deployable insight — not by model size alone, but by targeted context engineering.
James Le leads Developer Experience at TwelveLabs, where the company builds video foundation models (Marengo for multimodal embeddings, Pegasus for video-to-text generation) that enable production-grade video understanding across autonomous systems, media workflows, and security applications. His work focuses on translating video AI research capabilities into deployment-ready architectures for organizations building vision-enabled intelligent systems.