Large language models are increasingly powerful but remain bottlenecked by memory, both for storing weights and for the KV cache that grows with context length. Reducing this footprint unlocks faster inference and makes deployment practical at scale. This talk traces the evolution of quantization and pruning methods from early network compression to today's frontier techniques, highlighting recurring challenges such as outliers, the quantization gap, and the tension between algorithmic compression and real-world speedups, along with recent work on KV cache compression — what everyone building and using AI needs to know about making models fit and run.
Omead Pooladzandi is Co-Founder and Co-Head of Research at PrismML, a Pasadena-based AI lab building ultra-dense intelligence for the edge. He earned his Ph.D. in Electrical and Computer Engineering from UCLA, where his research focused on curvature-informed optimization and generative methods for efficient deep learning. His open-source work centers on PSGD, a preconditioned optimizer framework spanning first-order and second-order curvature informed methods used to train models at scale. His research has been published at ICML, NeurIPS, ICLR, and CVPR. At PrismML, he co-leads development of radically compressed architectures — including the lab's open-sourced 1-bit models — that run high-fidelity AI on edge with less memory than your Spotify cache.