VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought

📰 ArXiv cs.AI

arXiv:2604.21396v1 Announce Type: cross Abstract: The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model's logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we pr

Published 25 Apr 2026
Read full paper → ← Back to Reads