AMIGO: Agentic Multi-Image Grounding Oracle Benchmark

📰 ArXiv cs.AI

AMIGO is a benchmark for evaluating agentic vision-language models on long-horizon tasks with multiple images

advanced Published 31 Mar 2026

Action Steps

Design a benchmark with a long-horizon task that requires the model to identify a target image from a gallery of visually similar images
Implement a sequence of attribute-focused questions to recover the target image
Evaluate the model's performance using metrics such as accuracy and efficiency
Compare the results with other models and analyze the strengths and weaknesses of each model

Who Needs to Know This

AI researchers and engineers working on vision-language models can benefit from AMIGO to evaluate their models' performance on complex tasks, and product managers can use it to assess the capabilities of AI models in real-world applications

Key Insight

💡 AMIGO provides a challenging testbed for evaluating the capabilities of agentic vision-language models in real-world applications