Targeted Exploration via Unified Entropy Control for Reinforcement Learning

📰 ArXiv cs.AI

arXiv:2604.14646v1 Announce Type: new Abstract: Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain opti

Published 17 Apr 2026

Read full paper → ← Back to Reads