Visually-Guided Policy Optimization for Multimodal Reasoning

📰 ArXiv cs.AI

arXiv:2604.09349v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this ga

Published 13 Apr 2026
Read full paper → ← Back to Reads