Yulu Gan - FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
00:00 Why Motion Matters
01:46 VLMs Fail at Movement
03:24 Benchmarks Reveal Gaps
04:55 What Data Is Missing
06:08 Naive Auto Captioning Fails
08:09 Detector Plus LLM Pipeline
10:40 Pipeline Steps and Tracking
12:56 QA Generation and Categories
13:38 Fine Tuning and Results
15:50 Examples and Demos
17:39 Limits and Future Directions
18:32 Implications and Tradeoffs
21:13 Audience Q and A
32:28 Chat Questions Answered
37:08 What Im Working On Now
45:27 Wrap Up and Goodbye
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
Yulu Gan is a second-year CS PhD at MIT, studying AI and Science. Advised by Tomaso Po
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: CV Basics
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Traffic Light Recognition (TLR) Architecture: 2D Bounding Box Detection
Medium · Machine Learning
2D Gaussian Splatting: when removing a dimension makes 3D better
Medium · AI
"Mastering Digital Logic Counters with C++ OOP: A Hands-On Guide”
Dev.to · Abdullah Fiaz
Como o pensamento computacional me ajudou a estruturar minhas entregas
Medium · Programming
Chapters (16)
Why Motion Matters
1:46
VLMs Fail at Movement
3:24
Benchmarks Reveal Gaps
4:55
What Data Is Missing
6:08
Naive Auto Captioning Fails
8:09
Detector Plus LLM Pipeline
10:40
Pipeline Steps and Tracking
12:56
QA Generation and Categories
13:38
Fine Tuning and Results
15:50
Examples and Demos
17:39
Limits and Future Directions
18:32
Implications and Tradeoffs
21:13
Audience Q and A
32:28
Chat Questions Answered
37:08
What Im Working On Now
45:27
Wrap Up and Goodbye
🎓
Tutor Explanation
DeepCamp AI