Reinforcement learning systems often fail not because rewards are wrong, but because optimization pressure is unbounded. Policies exploit edge cases, drift over time, and converge to brittle strategies that look fine in training but break in deployment, especially under bounded actions, safety requirements, resource budgets, and long-term user impact.
This talk focuses on controlling optimization directly: practical techniques for training RL agents that remain stable and predictable under hard constraints. Rather than modifying rewards, we explore structural and system-level approaches that shape behavior by construction.
Topics include:
Why reward penalties alone fail to enforce hard constraints under scale and distribution shift
Structural constraint mechanisms such as action masking, feasibility filters, and sandboxed execution
How training inside hard boundaries changes policy behavior and improves long-horizon stability, including across retraining cycles
Detecting constraint violations and failure modes that do not appear in aggregate return metrics
Lessons from applying constrained RL in production-like systems, including failures only discovered after deployment and what ultimately stopped them
The goal is to share concrete algorithmic and system design strategies for deploying reinforcement learning in settings where violations are suboptimal.
Bio Coming Soon