Everyone knows bigger models reason better. What’s less obvious is that they often behave worse, especially when tools are involved.
In this talk showing a practical application of research around task fidelity, we’ll show how a 4B model was fine-tuned to beat a 235B model on real financial analysis tasks – not by adding more reasoning, but by enforcing tool discipline. Using reinforcement learning with the open-source rLLM framework, the model learned to explore schemas, validate outputs, and retry failures instead of hallucinating confident nonsense.
The key surprise: training on simple tool interactions transferred cleanly to much harder, multi-step problems. If you’re building LLM systems that touch things like databases, APIs, or internal tools, this talk focuses on the behaviors that actually matter — and how to teach them without frontier-scale compute.
Bio Coming Soon!