⚡ Prices increase $500 in 14 days. Lock in lowest price now!

Technical Talks

Lucas Atkins
Lucas Atkins
CTO | Arcee.ai

Trinity: Training a 400B MoE from Scratch Without Losing Your Mind

  • Model Systems

Training sparse Mixture-of-Experts models at scale is notoriously unstable. Experts collapse, routers drift, and loss spikes appear out of nowhere. This talk covers how we built Trinity Large, a 400B parameter MoE (13B active), trained on 17 trillion tokens with zero loss spikes.

We'll walk through the decisions that actually mattered: why we replaced standard aux-loss-free balancing with a momentum-based approach (SMEBU), how interleaved local/global attention made context extension surprisingly smooth, and what broke when we first tried running Muon at scale.

I'll also cover the less glamorous stuff: our Random Sequential Document Buffer to reduce batch heterogeneity, recovering from B300 GPU faults on brand-new hardware, and the six changes we shipped at once when routing started collapsing mid-run.

Practical lessons for teams training their own MoEs or scaling up sparse architectures

Lucas Atkins

CTO

Lucas Atkins

Arcee.ai

Lucas Atkins serves as CTO and Head of Research at Arcee AI, where he spearheaded the proprietary training stack behind Trinity, Arcee’s family of open-weight foundation models. A veteran in machine learning, Lucas engineered next-generation text-to-speech systems for leading automotive giants such as Volkswagen, BMW, Ford, and Hyundai starting well before the current generative AI surge. He has also led training of specialized language translation systems for the UAE and partnered with AMD in 2023 and 2024 to optimize enterprise GPUs for large-scale model training. He now uses that end-to-end experience to shape Arcee’s in-house models, aiming for efficient, open source & open-weight models that set the standard for American made open intelligence.