Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

📰 ArXiv cs.AI

VIGA is a framework that enables vision-language models to reconstruct images into editable programs through interleaved multimodal reasoning

advanced Published 7 Apr 2026

Action Steps

Define the problem of vision-as-inverse-graphics in vision-language models
Introduce the VIGA framework, which combines symbolic logic and visual perception
Implement interleaved multimodal reasoning to enable cross-verification between logic and perception
Evaluate the performance of VIGA in reconstructing images into editable programs

Who Needs to Know This

AI researchers and engineers working on vision-language models can benefit from VIGA, as it addresses the challenge of fine-grained spatial grounding in one-shot settings

Key Insight

💡 Interleaved multimodal reasoning can improve fine-grained spatial grounding in vision-language models