Technical Talks
Trinity: Training a 400B MoE from Scratch Without Losing Your Mind
Missing value detected...
Video will be populated after the conference
- Model Systems
Training sparse Mixture-of-Experts models at scale is notoriously unstable. Experts collapse, routers drift, and loss spikes appear out of nowhere. This talk covers how we built Trinity Large, a 400B parameter MoE (13B active), trained on 17 trillion tokens with zero loss spikes.
We'll walk through the decisions that actually mattered: why we replaced standard aux-loss-free balancing with a momentum-based approach (SMEBU), how interleaved local/global attention made context extension surprisingly smooth, and what broke when we first tried running Muon at scale.
I'll also cover the less glamorous stuff: our Random Sequential Document Buffer to reduce batch heterogeneity, recovering from B300 GPU faults on brand-new hardware, and the six changes we shipped at once when routing started collapsing mid-run.
Practical lessons for teams training their own MoEs or scaling up sparse architectures
CTO
Lucas Atkins
Arcee
Bio Coming Soon
Discover the data-driven foundations powering today's AI breakthroughs. Join leading minds as we explore both cutting-edge AI and the infrastructure behind it by subscribing to our newsletter today!