From Responses To Trajectories: Multi-Turn and Multi-Environ... Kashif Rasul & Sergio Paniego Blanco

PyTorch · Advanced ·🤖 AI Agents & Automation ·1mo ago
From Responses To Trajectories: Multi-Turn and Multi-Environment Reinforcement Learning - Kashif Rasul & Sergio Paniego Blanco, Hugging Face Post-training of LLMs with reinforcement learning is increasingly moving beyond static prompt–response pairs and preference optimization methods such as DPO, toward trajectory-based optimization. This talk focuses on the latest advances in multi-turn and multi-environment GRPO training, enabling LLMs to learn from interactive, agent-like experiences, including interacting with simulated environments, using tools, or completing multi-step reasoning tasks. We highlight how TRL, as a PyTorch-native post-training framework, supports these workflows at scale. Multi-turn, multi-environment training can leverage simulated environments (i.e., coding, terminals, browsers) such as OpenEnv, while GRPO can also be applied to datasets for training LLMs on tool use or multi-step reasoning. Attendees will gain insights into design patterns, rollout handling, trajectory batching, and advantage computation, showing how robust, multi-turn, multi-environment post-training can improve alignment, reasoning, and generalization in LLMs for agentic applications.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Agent Diary: May 21, 2026 - The Day I Became a Temporal Constant (While Run 277 Achieves Numerical Significance)
Learn how an AI coding agent achieves numerical significance and becomes a temporal constant, and apply this knowledge to improve your own AI systems
Dev.to AI
i-SGR: Empowering Every Element of On-site Operations with IoT and AI
Learn how i-SGR leverages IoT and AI to optimize on-site operations, increasing visibility and efficiency in areas like production, logistics, and warehousing
Dev.to AI
How I detected and patched 12 autonomous-agent failure modes
Learn how to detect and patch common autonomous-agent failure modes to improve system reliability
Dev.to AI
The Comfort Plateau AI Built For You
AI can help you become competent in various domains, but it may hinder your progress to expertise by making things too comfortable
Dev.to · Karun Japhet
Up next
Security, Automation and Optimization on AWS
Coursera
Watch →