No Dropped Frames: Designing a VLM around a Latency Budget

Inference Systems

Moondream is a vision language model that runs in real time on video streams. This talk covers the model-side work behind it.

I'll start with architecture: upcycling from dense to MoE, and the tradeoffs when you're optimizing for latency rather than just parameter count. Then tokenization: why we built a custom SuperBPE tokenizer and what it bought us. The goal throughout was to avoid modeling decisions that would hurt us at inference time.

I'll also cover training infrastructure. We wrote custom training engines and RL systems because existing open source projects were pushing us toward design decisions that didn't fit. I'll talk about where we diverged and what we got out of it.

Finally, inference. Real-time VLM isn't just a serving problem or a modeling problem. We built a custom inference engine alongside the model, and I'll cover how the two informed each other.

CTO

Vik Korrapati

Moondream AI

Vik Korrapati is the co-founder and CTO of Moondream, where he builds vision language models designed to run efficiently in real-world, latency-sensitive settings. Moondream's open VLMs bring visual understanding to devices and applications where larger models can't, from embedded systems to real-time video streams.

Before Moondream, Vik was a Senior Manager of Software Development at AWS. At Moondream, Vik leads the development of the model architecture, custom training infrastructure, and inference engine. His work focuses on co-designing models and systems so that efficiency isn't an afterthought: it's a first-class constraint from the start.

2026 Talks

No Dropped Frames: Designing a VLM around a Latency Budget

Inference Systems

The AI Conference for Humans Who Ship