Technical Talks
No Dropped Frames: Designing a VLM around a Latency Budget
Missing value detected...
Video will be populated after the conference
- Model Systems
Moondream is a vision language model that runs in real time on video streams. This talk covers the model-side work behind it.
I'll start with architecture: upcycling from dense to MoE, and the tradeoffs when you're optimizing for latency rather than just parameter count. Then tokenization: why we built a custom SuperBPE tokenizer and what it bought us. The goal throughout was to avoid modeling decisions that would hurt us at inference time.
I'll also cover training infrastructure. We wrote custom training engines and RL systems because existing open source projects were pushing us toward design decisions that didn't fit. I'll talk about where we diverged and what we got out of it.
Finally, inference. Real-time VLM isn't just a serving problem or a modeling problem. We built a custom inference engine alongside the model, and I'll cover how the two informed each other.
CTO
Vik Korrapati
M87 Labs (Moondream)
Discover the data-driven foundations powering today's AI breakthroughs. Join leading minds as we explore both cutting-edge AI and the infrastructure behind it by subscribing to our newsletter today!