Foundations

Computer Vision

Object detection, segmentation, YOLO, CLIP, and vision-language models

2,484

lessons

Skills in this topic

3 skills — Sign in to track your progress

View full skill map →

Classify images with a pre-trained CNN

Modern CV Models

Run YOLO for real-time object detection

Build a Stable Diffusion inference pipeline

Videos 1,149 Reads 1,335

All Reads (1,335) Articles (421)Blog Posts (276)Tutorials (91)Research Papers (529)News (18)

Level: All Beginner Intermediate Advanced

Newest Popular Oldest

Masked depth modeling with sensor-validity masking: reports best RMSE on 7 of 8 masked/sparse depth benchmarks, plus a controlled encoder-init study[R]

Reddit r/MachineLearning 👁️ Computer Vision ⚡ AI Lesson 3d ago

Masked depth modeling with sensor-validity masking: reports best RMSE on 7 of 8 masked/sparse depth benchmarks, plus a controlled encoder-init study[R]

<img src="https://preview.redd.it/jfsxqgnx4sbh1.png?width=140&height=41&auto=webp&s=c9d6b3a3d3ceffc8d77a713d0c41c4011d091a2f" alt="Masked depth mode

Detect Objects Using Qwen: From Prompt to Bounding Box

Medium · LLM 👁️ Computer Vision ⚡ AI Lesson 3d ago

Detect Objects Using Qwen: From Prompt to Bounding Box

Learn to use Qwen VLM with a single line of Python Continue reading on Stackademic »

How AI Companies Use Image Annotation to Train Models

Medium · AI 👁️ Computer Vision ⚡ AI Lesson 3d ago

How AI Companies Use Image Annotation to Train Models

Artificial intelligence has now become part of our daily lives in a way most people may not even realize it. With applications in face… Continue reading on Medi

Reddit r/deeplearning 👁️ Computer Vision ⚡ AI Lesson 3d ago

🔬 New paper: IMGNet — face verification through relational patterns, not absolute values.

Inspired by a linguistic observation: "matur suwun" (Javanese) and "hatur nuhun" (Sundanese) — two phrases from Indonesia that mean the same thing despite compl

WebGPU is not WebGL 3.0

Medium · JavaScript 👁️ Computer Vision ⚡ AI Lesson 3d ago

WebGPU is not WebGL 3.0

A practical introduction to WebGPU, WGSL, render pipelines, compute shaders, and the future of high-performance graphics on the web. Continue reading on Medium

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

SiamixFormer: a fully-transformer Siamese network with temporal Fusion for accurate building detection and change detection in bi-temporal remote sensing images

arXiv:2208.00657v2 Announce Type: cross Abstract: Building detection and change detection using remote sensing images can help urban and rescue planning. Moreov

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Double-Helix Active Geometry: LiDAR-Anchored Multi-View Depth with Selective Abstention

arXiv:2607.02561v1 Announce Type: cross Abstract: Consumer depth sensors such as the LiDAR scanner on recent iPhones provide metric range, but their useful rang

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Additive Causal Construction for Transferable and Reconfigurable Cross-System Learning in Multi-Source Image Fusion

arXiv:2607.02572v1 Announce Type: cross Abstract: In multi-source image fusion scenarios, heterogeneous inputs are typically driven by distinct generative mecha

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Fusion: A Framework for Unified Sequential Token AdaptatIon in VisiOn TraNsformers

arXiv:2607.02612v1 Announce Type: cross Abstract: Vision Transformers achieve strong image classification accuracy but process all image regions with nearly the

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Post-Generation Curation of Synthetic Images via Homogeneous-Heterogeneous Splitting

arXiv:2607.02637v1 Announce Type: cross Abstract: Recent generative models can produce high-quality synthetic images, offering scalable training training data f

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Diagnosing Aerial-View Object Detectors with Foundational Image Generative Models

arXiv:2607.02718v1 Announce Type: cross Abstract: Recent advances in large-scale image generative models enable photorealistic scene synthesis with controllable

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

R3D: Quantitative 3D Spatial Reasoning for Egocentric Wearables

arXiv:2607.02921v1 Announce Type: cross Abstract: Quantitative 3D spatial reasoning from egocentric RGB-D video is a critical capability for next-generation wea

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Pooling-Based Context Modeling for Convolution-Free Deep Image Prior

arXiv:2607.02952v1 Announce Type: cross Abstract: Convolutional Neural Networks (CNNs) achieve strong denoising performance by exploiting spatial context from n

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Attention-Guided Efficientnet Architecture For Precise Criminal Identification in Surveillance Images

arXiv:2607.03073v1 Announce Type: cross Abstract: Criminal identification from surveillance imagery has become a critical research area in intelligent forensic

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

A Multi-Task Deep Learning Framework for Real-Time Intelligent Video Surveillance with Temporal Event Validation

arXiv:2607.03131v1 Announce Type: cross Abstract: Modern video surveillance systems generate far more video streams than human operators can effectively monitor

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Semantic Segmentation-Driven Image-Level Diagnosis of Liver Cancers in Hematoxylin and Eosin Histopathology Images

arXiv:2607.03253v1 Announce Type: cross Abstract: As hematoxylin & eosin (H&E) staining constitutes the primary entry point in routine diagnostic workflows, com

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Latent Clarity: Bridging World-Model Kinematics to Semantic Manifolds for Video Anomaly Anticipation

arXiv:2607.03558v1 Announce Type: cross Abstract: Continuous video anomaly detection is dominated by reactive Multiple Instance Learning (MIL) that collapses sp

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

PLGSA-Transformer: Periocular Landmark-Guided Attention with Occlusion-Adaptive Cosine Thresholding for Cross-Modal Masked and Unmasked Face Recognition

arXiv:2607.03581v1 Announce Type: cross Abstract: The widespread adoption of facial masks, accelerated by COVID-19 and mandated in security-sensitive settings,

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

ClinOCR-Bench: A Comprehensive Clinical Scanned Document Dataset for Optical Character Recognition Model Evaluation

arXiv:2607.03650v1 Announce Type: cross Abstract: Extracting textual information from scanned medical documents, such as external laboratory reports and manuall

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Rethinking Depth Pruning for Vision Transformers: A Heterogeneity-Aware Perspective

arXiv:2607.03784v1 Announce Type: cross Abstract: While prior studies have successfully compressed vision Transformers (ViTs) through various pruning techniques

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

CineMobile: On-Device Image-to-Video Diffusion for Cinematic Camera Motion Generation

arXiv:2607.03803v1 Announce Type: cross Abstract: The growing demand for image-to-video creation on mobile devices has increasingly focused on cinematic motion

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

CGGS: Consistency-Augmented Geometric Gaussian Splatting for Ego-centric 3D Scene Generation

arXiv:2607.03819v1 Announce Type: cross Abstract: Challenges remain in ego-centric 3D scene generation due to limited view overlap and the dominant influence of

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

GeoSelect: Spatial-Program Execution for Training-Free Referring Remote Sensing Image Segmentation

arXiv:2607.03869v1 Announce Type: cross Abstract: Referring remote sensing image segmentation isolates the object named by a natural-language expression in an a

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

SOV-CAD: Stepwise Orthographic Views Guided CAD Modeling Sequence Reconstruction

arXiv:2607.04119v1 Announce Type: cross Abstract: Reconstructing Computer-Aided Design (CAD) modeling sequences from images is crucial for preserving design int

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

HCSU: A Dataset and Benchmark for Fine-Grained Historical Calligraphy Style Understanding

arXiv:2607.04147v1 Announce Type: cross Abstract: Automated fine-grained perception of calligraphy styles--a task vital to cultural heritage preservation--remai

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

CRISP: A Spatiotemporal Camera-Radar Backbone for Driving via Forecasting-Based World-Model Pretraining

arXiv:2607.04541v1 Announce Type: cross Abstract: Camera-radar (CR) fusion is a practical sensing configuration for autonomous driving, but existing models are

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

LCPNet: Latent Consistent Proximal Unfolding Network for Infrared Small Target Detection

arXiv:2607.04603v1 Announce Type: cross Abstract: Infrared small target detection (IRSTD) aims to identify long distance small targets from complex infrared bac

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Do All Visual Tokens Matter Equally? Object-Evidence Preserving Token Merging for Vision-Language Retrieval

arXiv:2607.04605v1 Announce Type: cross Abstract: Multi-vector vision-language retrieval preserves fine-grained visual evidence through maximum-similarity late

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Targeted Structure Completion for Sparse-View 3D Reconstruction in Autonomous Driving

arXiv:2607.04661v1 Announce Type: cross Abstract: Reconstructing 3D scene structures from sparse, low-overlap observations remains a fundamental challenge in au

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

EventCoT: Event-centric Video Chain-of-thought for Reasoning Temporal Localization

arXiv:2607.04872v1 Announce Type: cross Abstract: Reasoning temporal localization (RTL) requires a model to generate an answer that itself contains the time int

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

MemPose: Category-level Object Pose Estimation with Memory

arXiv:2607.04930v1 Announce Type: cross Abstract: In the pursuit of robust and generalizable category-level object pose estimation, most existing methods adopt

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

From Fixed to Free Cameras: Calibration-Free View-Robust Vision-Language-Action Model

arXiv:2607.05396v1 Announce Type: cross Abstract: Real-world robot deployment rarely maintains the training-stage camera setup, where cameras often experience r

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment

arXiv:2202.14019v3 Announce Type: replace-cross Abstract: Maintaining proper form while exercising is important for preventing injuries and maximizing muscle ma

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Learning to Visually Connect Actions and their Effects

arXiv:2401.10805v4 Announce Type: replace-cross Abstract: We introduce the novel concept of visually Connecting Actions and Their Effects (CATE) in video unders

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Quick ViTs: Speeding up Vision Transformers through Equivariance

arXiv:2505.15441v5 Announce Type: replace-cross Abstract: Natural images exhibit strong geometric regularities: local structures, such as edges, corners, and te

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

FuseMamba-VD: Dual Branch VideoMamba with Gated Class Token Fusion for Violence Detection

arXiv:2506.03162v3 Announce Type: replace-cross Abstract: The rapid proliferation of surveillance cameras has increased the demand for automated violence detect

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

arXiv:2508.04928v5 Announce Type: replace-cross Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

SilvaScenes: Tree Detection and Species Classification from Under-Canopy Images in Natural Forests

arXiv:2510.09458v2 Announce Type: replace-cross Abstract: Interest in forestry automation is growing alongside rapid advances in deep learning. In particular, t

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Tracing 3D Anatomy in 2D Strokes: A Multi-Stage Projection Driven Approach to Cervical Spine Fracture Identification

arXiv:2601.15235v4 Announce Type: replace-cross Abstract: Cervical spine fractures require rapid and accurate diagnosis, yet automatic CT interpretation remains

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

arXiv:2601.22054v2 Announce Type: replace-cross Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric

ArXiv cs.AI 👁️ Computer Vision 📄 Paper ⚡ AI Lesson 3d ago

Human-like Object Grouping in Self-supervised Vision Transformers

arXiv:2603.13994v2 Announce Type: replace-cross Abstract: Vision foundation models trained with self-supervised objectives achieve strong performance across div

OpenCV 5.0 Is a Big Deal — With One Big Asterisk

Medium · LLM 👁️ Computer Vision ⚡ AI Lesson 3d ago

OpenCV 5.0 Is a Big Deal — With One Big Asterisk

OpenCV 5.0, released on June 4, 2026, is a major update to the computer vision library. It replaces the legacy DNN engine with a modern… Continue reading on Med

How to hide strings in C++ binaries with consteval

Dev.to · LulLaS 👁️ Computer Vision ⚡ AI Lesson 3d ago

How to hide strings in C++ binaries with consteval

How to hide strings in C++ binaries with consteval — and why it beats xorstr The...

Designing a Clean Computer Vision UI with OpenCV

Medium · Programming 👁️ Computer Vision ⚡ AI Lesson 3d ago

Designing a Clean Computer Vision UI with OpenCV

Default overlays look like debug output Continue reading on Medium »

Designing a Clean Computer Vision UI with OpenCV

Medium · Python 👁️ Computer Vision ⚡ AI Lesson 3d ago

Designing a Clean Computer Vision UI with OpenCV

Default overlays look like debug output Continue reading on Medium »

Instead of Feeding an Image to the Model, What If You Rotated the Model Toward the Image?

Medium · LLM 👁️ Computer Vision ⚡ AI Lesson 3d ago

Instead of Feeding an Image to the Model, What If You Rotated the Model Toward the Image?

There’s a CVPR 2026 paper out of the Institute of Automation, Chinese Academy of Sciences with a genuinely odd premise. To appreciate why… Continue reading on M

Reddit r/deeplearning 👁️ Computer Vision ⚡ AI Lesson 3d ago

Searching for a model which detects spread page of a book

I'm new to this sub, so sorry if this question is not appropriate for here. I'm currently developing a document scanner app that specializes in cropping the pag

Medium · Machine Learning 👁️ Computer Vision ⚡ AI Lesson 3d ago

Computer Vision: More Than Just Teaching Machines to See

When I first started learning about Artificial Intelligence, I thought it was all about robots, chatbots, and complex programming. During… Continue reading on M