Trinity: Training a 400B MoE from Scratch Without Losing Your Mind

Technical Talks

Training sparse Mixture-of-Experts models at scale is notoriously unstable. Experts collapse, routers drift, and loss spikes appear out of nowhere. This talk covers how we built Trinity Large, a 400B parameter MoE (13B active), trained on 17 trillion tokens with zero loss spikes.

We'll walk through the decisions that actually mattered: why we replaced standard aux-loss-free balancing with a momentum-based approach (SMEBU), how interleaved local/global attention made context extension surprisingly smooth, and what broke when we first tried running Muon at scale.

I'll also cover the less glamorous stuff: our Random Sequential Document Buffer to reduce batch heterogeneity, recovering from B300 GPU faults on brand-new hardware, and the six changes we shipped at once when routing started collapsing mid-run.

Practical lessons for teams training their own MoEs or scaling up sparse architectures

Lucas Atkins

CTO | Arcee

Bio Coming Soon

Technical Talks

Trinity: Training a 400B MoE from Scratch Without Losing Your Mind

FEATURED MEETINGS

Follow / Join Us

Contact Us

Menu