Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
📰 ArXiv cs.AI
arXiv:2604.10674v1 Announce Type: cross Abstract: Reinforcement learning (RL) has been widely used to train LLM agents for multi-turn interactive tasks, but its sample efficiency is severely limited by sparse rewards and long horizons. On-policy self-distillation (OPSD) alleviates this by providing dense token-level supervision from a privileged teacher that has access to ground-truth answers. However, such fixed privileged information cannot capture the diverse valid strategies in agent tasks,
DeepCamp AI