TIP: Token Importance in On-Policy Distillation
📰 ArXiv cs.AI
arXiv:2604.14084v1 Announce Type: cross Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high t
DeepCamp AI