The Future of Vision in ML | Merve Noyan | HF Podcast #1

Hugging Face · Beginner ·👁️ Computer Vision ·1mo ago

Skills: Modern CV Models80%

In this episode, we sit down with Merve to talk about where vision AI is heading: from early computer vision systems to modern multimodal models, world models, robotics, and open source AI. We discuss LLaVA, IDEFICS, Vision Transformers, CNNs, JEPA, V-JEPA, Genie 3, OpenClaw, IMCP, PaliGemma, ColPali, ColQwen, and why Hugging Face has become such a central part of the open ecosystem. ## Connect with Merve Noyan, the open-sourceress 👇 - X (twitter): https://x.com/mervenoyann - LinkedIn: https://www.linkedin.com/in/merve-noyan-28b1a113a/ - Personal Site: https://merveenoyan.github.io/me/ - GitHub: https://github.com/merveenoyan ## Chapters 00:00 Intro: vision, Hugging Face, and the future of AI 00:31 Why vision feels different now 03:58 LLaVA, IDEFICS, and multimodal training 08:56 CNNs, ViTs, and older vision architectures 15:46 How vision models could reach everyday users 16:50 World models, JEPA, V-JEPA, Genie 3, and robotics 25:44 OpenClaw, IMCP, and agent safety 28:01 Small vision models, fine-tuning, and getting started 34:39 Why Hugging Face matters in open source AI 42:49 PaliGemma, ColPali, ColQwen, and vision retrieval 47:26 Before Hugging Face: how models were shared 49:48 Mentors, culture, and closing thoughts If you enjoyed the episode, subscribe for more conversations about open models, multimodal systems, and the future of AI.

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Modern CV Models

View skill →

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

YOLOE: Real-time Zero-shot Object Detection | Visual Prompting | Live Coding & Q&A (Mar 14th)

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

RF-DETR: How to Train SOTA for Object Detection on a Custom Dataset | Step-by-step guide

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Build a Deep Facial Recognition App // Part 8 - Kivy Computer Vision App with OpenCV and Tensorflow

Nicholas Renotte

Deep Learning with PyTorch : Image Segmentation

Deep Learning with PyTorch : Image Segmentation

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

Mesh Optimization Using FlexiCubes with NVIDIA Kaolin Library v0.15.0

NVIDIA Developer

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Code Panoptic Image Segmentation w/ Vision Transformer & Mask2Former - A PyTorch tutorial

Related AI Lessons

Traffic Light Recognition (TLR) Architecture: 2D Bounding Box Detection

Learn to build a Traffic Light Recognition model using a Fully Convolutional Network and anchor-free approach

Medium · Machine Learning

2D Gaussian Splatting: when removing a dimension makes 3D better

Learn how 2D Gaussian Splatting improves 3D rendering by addressing surface failures

"Mastering Digital Logic Counters with C++ OOP: A Hands-On Guide”

Learn to implement digital logic counters using C++ and object-oriented programming (OOP) to track events and understand fundamental electronics and computing concepts

Dev.to · Abdullah Fiaz

Como o pensamento computacional me ajudou a estruturar minhas entregas

Learn how computational thinking helped structure deliveries in programming

Medium · Programming

Chapters (12)

Intro: vision, Hugging Face, and the future of AI

0:31 Why vision feels different now

3:58 LLaVA, IDEFICS, and multimodal training

8:56 CNNs, ViTs, and older vision architectures

15:46 How vision models could reach everyday users

16:50 World models, JEPA, V-JEPA, Genie 3, and robotics

25:44 OpenClaw, IMCP, and agent safety

28:01 Small vision models, fine-tuning, and getting started

34:39 Why Hugging Face matters in open source AI

42:49 PaliGemma, ColPali, ColQwen, and vision retrieval

47:26 Before Hugging Face: how models were shared

49:48 Mentors, culture, and closing thoughts

Yulu Gan - FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos