Multiplayer Nash Preference Optimization

📰 ArXiv cs.AI

arXiv:2509.23102v3 Announce Type: replace Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley-Terry assumption struggle to capture the nontransitivity and heterogeneity of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While

Published 8 Apr 2026
Read full paper → ← Back to News