From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

📰 ArXiv cs.AI

arXiv:2604.14142v1 Announce Type: cross Abstract: While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static c

Published 16 Apr 2026
Read full paper → ← Back to Reads