Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

📰 ArXiv cs.AI

arXiv:2604.13010v1 Announce Type: cross Abstract: On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In pra

Published 15 Apr 2026

Read full paper → ← Back to Reads