Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward
📰 ArXiv cs.AI
arXiv:2604.09748v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is an emerging paradigm that significantly boosts a Large Language Model's (LLM's) reasoning abilities on complex logical tasks, such as mathematics and programming. However, we identify, for the first time, a latent vulnerability to backdoor attacks within the RLVR framework. This attack can implant a backdoor without modifying the reward verifier by injecting a small amount of poisoning data
DeepCamp AI