MC-CPO: Mastery-Conditioned Constrained Policy Optimization

📰 ArXiv cs.AI

arXiv:2604.04251v1 Announce Type: new Abstract: Engagement-optimized adaptive tutoring systems may prioritize short-term behavioral signals over sustained learning outcomes, creating structural incentives for reward hacking in reinforcement learning policies. We formalize this challenge as a constrained Markov decision process (CMDP) with mastery-conditioned feasibility, in which pedagogical safety constraints dynamically restrict admissible actions according to learner mastery and prerequisite

Published 7 Apr 2026
Read full paper → ← Back to News