Qwen 3.5 | Building a Visual AI Agent to Control Your Computer
Skills:
Agent Foundations90%
In this video, Matvei Popov, Machine Learning Engineer at Roboflow, explores the capabilities the Qwen 3.5 vision language model (VLM). Unlike previous architectures that separated language and vision encoders, Qwen 3.5 natively combines vision, language, and coding capabilities within a single model, unlocking advanced agentic capabilities like tool calling and direct computer use.
Matvei starts by showing how to run Qwen 3.5 locally using Roboflow Inference and how to set it up within Roboflow Workflows for basic image description tasks. He demonstrates how to use the 0.8B and 2B models, adjust parameters like system prompts, and configure token generation limits.
The true power of Qwen 3.5 is revealed in the second half of the video when Matvei builds a visual AI agent capable of controlling his computer. Using a custom Python script and Roboflow Inference, Matvei tasks Qwen 3.5 with navigating the Roboflow UI to kick off a new model training job. The model analyzes screenshots of the UI, outputs normalized screen coordinates for specific buttons, and executes the clicks autonomously—proving how this technology can be used to automate complex UI manipulation or guide physical systems.
= Additional Resources =
Roboflow Inference: https://inference.roboflow.com/
Roboflow Workflows: https://roboflow.com/workflows
= Chapters =
00:00 Introduction to Qwen 3.5: Why Native Vision-Language Models Matter
02:03 Setting Up Qwen 3.5 in Roboflow Workflows
03:51 Running Qwen 3.5 Locally with Roboflow Inference
05:47 Building a Visual Agent for "Computer Use" Automation
07:29 Live Demo: Qwen 3.5 Clicks Buttons and Starts a Training Job
08:38 Exploring Tool Calls and Future Use Cases
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Agent Foundations
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
Actually, vibe coding didn't kill testing — agentic engineering did
Dev.to · Muggle AI
Gemini 3.1 Flash Lite vs DeepSeek V4 Flash: Budget API Showdown for High-Volume Agent Loops (2026)
Dev.to AI
WebMCP Reality Check: Where the Spec Actually Stands
Dev.to AI
The 2026 Enterprise AI Mandate: From Generative Potential to Agentic Execution
Dev.to AI
Chapters (6)
Introduction to Qwen 3.5: Why Native Vision-Language Models Matter
2:03
Setting Up Qwen 3.5 in Roboflow Workflows
3:51
Running Qwen 3.5 Locally with Roboflow Inference
5:47
Building a Visual Agent for "Computer Use" Automation
7:29
Live Demo: Qwen 3.5 Clicks Buttons and Starts a Training Job
8:38
Exploring Tool Calls and Future Use Cases
🎓
Tutor Explanation
DeepCamp AI