Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

📰 ArXiv cs.AI

arXiv:2603.29252v1 Announce Type: cross Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most

Published 1 Apr 2026

Read full paper → ← Back to News