Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

📰 ArXiv cs.AI

arXiv:2604.10721v1 Announce Type: cross Abstract: Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directl

Published 14 Apr 2026
Read full paper → ← Back to Reads