InstrAct: Towards Action-Centric Understanding in Instructional Videos

📰 ArXiv cs.AI

arXiv:2604.08762v1 Announce Type: cross Abstract: Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We firs

Published 13 Apr 2026
Read full paper → ← Back to Reads