Technical Talks

View All

No Dropped Frames: Designing a VLM around a Latency Budget

Vik Korrapati Vik Korrapati | CTO | M87 Labs (Moondream)

Moondream is a vision language model that runs in real time on video streams. This talk covers the model-side work behind it.

I'll start with architecture: upcycling from dense to MoE, and the tradeoffs when you're optimizing for latency rather than just parameter count. Then tokenization: why we built a custom SuperBPE tokenizer and what it bought us. The goal throughout was to avoid modeling decisions that would hurt us at inference time.

I'll also cover training infrastructure. We wrote custom training engines and RL systems because existing open source projects were pushing us toward design decisions that didn't fit. I'll talk about where we diverged and what we got out of it.

Finally, inference. Real-time VLM isn't just a serving problem or a modeling problem. We built a custom inference engine alongside the model, and I'll cover how the two informed each other.

Vik Korrapati
Vik Korrapati
CTO | M87 Labs (Moondream)