Can LLMs Learn to Reason Robustly under Noisy Supervision?
📰 ArXiv cs.AI
arXiv:2604.03993v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's in
DeepCamp AI