PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

📰 ArXiv cs.AI

arXiv:2604.08991v1 Announce Type: cross Abstract: Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position

Published 13 Apr 2026
Read full paper → ← Back to Reads