Yulu Gan - FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Cohere · Advanced ·👁️ Computer Vision ·2d ago

Skills: CV Basics80%LLM Engineering60%

00:00 Why Motion Matters 01:46 VLMs Fail at Movement 03:24 Benchmarks Reveal Gaps 04:55 What Data Is Missing 06:08 Naive Auto Captioning Fails 08:09 Detector Plus LLM Pipeline 10:40 Pipeline Steps and Tracking 12:56 QA Generation and Categories 13:38 Fine Tuning and Results 15:50 Examples and Demos 17:39 Limits and Future Directions 18:32 Implications and Tradeoffs 21:13 Audience Q and A 32:28 Chat Questions Answered 37:08 What Im Working On Now 45:27 Wrap Up and Goodbye Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities. Yulu Gan is a second-year CS PhD at MIT, studying AI and Science. Advised by Tomaso Po

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: CV Basics

View skill →

Identify Horses or Humans with TensorFlow and Vertex AI

How to Build and Install OpenCV from Source | Using Visual Studio and CMake | Computer Vision

How to Build and Install OpenCV from Source | Using Visual Studio and CMake | Computer Vision

Building a Dog Breed Identifier App from scratch - DogNet

Building a Dog Breed Identifier App from scratch - DogNet

Aladdin Persson

Apply OpenGL Texturing and Camera Systems

Apply OpenGL Texturing and Camera Systems

Aerial Image Segmentation with PyTorch

Aerial Image Segmentation with PyTorch

How to Install Stable Diffusion - automatic1111

How to Install Stable Diffusion - automatic1111

Sebastian Kamph

Related AI Lessons

Traffic Light Recognition (TLR) Architecture: 2D Bounding Box Detection

Learn to build a Traffic Light Recognition model using a Fully Convolutional Network and anchor-free approach

Medium · Machine Learning

2D Gaussian Splatting: when removing a dimension makes 3D better

Learn how 2D Gaussian Splatting improves 3D rendering by addressing surface failures

"Mastering Digital Logic Counters with C++ OOP: A Hands-On Guide”

Learn to implement digital logic counters using C++ and object-oriented programming (OOP) to track events and understand fundamental electronics and computing concepts

Dev.to · Abdullah Fiaz

Como o pensamento computacional me ajudou a estruturar minhas entregas

Learn how computational thinking helped structure deliveries in programming

Medium · Programming

Chapters (16)

Why Motion Matters

1:46 VLMs Fail at Movement

3:24 Benchmarks Reveal Gaps

4:55 What Data Is Missing

6:08 Naive Auto Captioning Fails

8:09 Detector Plus LLM Pipeline

10:40 Pipeline Steps and Tracking

12:56 QA Generation and Categories

13:38 Fine Tuning and Results

15:50 Examples and Demos

17:39 Limits and Future Directions

18:32 Implications and Tradeoffs

21:13 Audience Q and A

32:28 Chat Questions Answered

37:08 What Im Working On Now

45:27 Wrap Up and Goodbye

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow