2026 Talks
No Dropped Frames: Designing a VLM around a Latency Budget
Missing value detected...
Video will be populated after the conference
- Inference Systems
Moondream is a vision language model that runs in real time on video streams. This talk covers the model-side work behind it.
I'll start with architecture: upcycling from dense to MoE, and the tradeoffs when you're optimizing for latency rather than just parameter count. Then tokenization: why we built a custom SuperBPE tokenizer and what it bought us. The goal throughout was to avoid modeling decisions that would hurt us at inference time.
I'll also cover training infrastructure. We wrote custom training engines and RL systems because existing open source projects were pushing us toward design decisions that didn't fit. I'll talk about where we diverged and what we got out of it.
Finally, inference. Real-time VLM isn't just a serving problem or a modeling problem. We built a custom inference engine alongside the model, and I'll cover how the two informed each other.
CTO
Vik Korrapati
Moondream AI
Vik Korrapati is the co-founder and CTO of Moondream, where he builds vision language models designed to run efficiently in real-world, latency-sensitive settings. Moondream's open VLMs bring visual understanding to devices and applications where larger models can't, from embedded systems to real-time video streams.
Before Moondream, Vik was a Senior Manager of Software Development at AWS. At Moondream, Vik leads the development of the model architecture, custom training infrastructure, and inference engine. His work focuses on co-designing models and systems so that efficiency isn't an afterthought: it's a first-class constraint from the start.
The AI Conference for Humans Who Ship
While other conferences theorize, AI Council features the engineers shipping tomorrow's breakthroughs today.