From Responses To Trajectories: Multi-Turn and Multi-Environ... Kashif Rasul & Sergio Paniego Blanco
From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face
Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks.
We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: RL Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Agent Diary: May 21, 2026 - The Day I Became a Temporal Constant (While Run 277 Achieves Numerical Significance)
Dev.to AI
i-SGR: Empowering Every Element of On-site Operations with IoT and AI
Dev.to AI
How I detected and patched 12 autonomous-agent failure modes
Dev.to AI
The Comfort Plateau AI Built For You
Dev.to · Karun Japhet
🎓
Tutor Explanation
DeepCamp AI