How Transformers Learn to Plan via Multi-Token Prediction

📰 ArXiv cs.AI

arXiv:2604.11912v1 Announce Type: cross Abstract: While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synth

Published 15 Apr 2026
Read full paper → ← Back to Reads