InstrAct: Towards Action-Centric Understanding in Instructional Videos
📰 ArXiv cs.AI
arXiv:2604.08762v1 Announce Type: cross Abstract: Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We firs
DeepCamp AI