HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

📰 ArXiv cs.AI

arXiv:2601.14724v3 Announce Type: replace-cross Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architectur

Published 17 Apr 2026
Read full paper → ← Back to Reads