Qwen 3.5 | Building a Visual AI Agent to Control Your Computer

Roboflow · Beginner ·🤖 AI Agents & Automation ·6h ago
In this video, Matvei Popov, Machine Learning Engineer at Roboflow, explores the capabilities the Qwen 3.5 vision language model (VLM). Unlike previous architectures that separated language and vision encoders, Qwen 3.5 natively combines vision, language, and coding capabilities within a single model, unlocking advanced agentic capabilities like tool calling and direct computer use. Matvei starts by showing how to run Qwen 3.5 locally using Roboflow Inference and how to set it up within Roboflow Workflows for basic image description tasks. He demonstrates how to use the 0.8B and 2B models, adjust parameters like system prompts, and configure token generation limits. The true power of Qwen 3.5 is revealed in the second half of the video when Matvei builds a visual AI agent capable of controlling his computer. Using a custom Python script and Roboflow Inference, Matvei tasks Qwen 3.5 with navigating the Roboflow UI to kick off a new model training job. The model analyzes screenshots of the UI, outputs normalized screen coordinates for specific buttons, and executes the clicks autonomously—proving how this technology can be used to automate complex UI manipulation or guide physical systems. = Additional Resources = Roboflow Inference: https://inference.roboflow.com/ Roboflow Workflows: https://roboflow.com/workflows = Chapters = 00:00 Introduction to Qwen 3.5: Why Native Vision-Language Models Matter 02:03 Setting Up Qwen 3.5 in Roboflow Workflows 03:51 Running Qwen 3.5 Locally with Roboflow Inference 05:47 Building a Visual Agent for "Computer Use" Automation 07:29 Live Demo: Qwen 3.5 Clicks Buttons and Starts a Training Job 08:38 Exploring Tool Calls and Future Use Cases
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Actually, vibe coding didn't kill testing — agentic engineering did
Learn how agentic engineering is changing the landscape of testing and development, and why it's more impactful than vibe coding
Dev.to · Muggle AI
Gemini 3.1 Flash Lite vs DeepSeek V4 Flash: Budget API Showdown for High-Volume Agent Loops (2026)
Compare Gemini 3.1 Flash Lite and DeepSeek V4 Flash for budget-friendly API options in high-volume agent loops, considering tradeoffs between pricing and reliability
Dev.to AI
WebMCP Reality Check: Where the Spec Actually Stands
Learn the current state of WebMCP and its limitations, and why major agents aren't using it yet
Dev.to AI
The 2026 Enterprise AI Mandate: From Generative Potential to Agentic Execution
Enterprises must shift from AI experimentation to Agentic Execution, leveraging AI as a proactive coworker in operational workflows
Dev.to AI

Chapters (6)

Introduction to Qwen 3.5: Why Native Vision-Language Models Matter
2:03 Setting Up Qwen 3.5 in Roboflow Workflows
3:51 Running Qwen 3.5 Locally with Roboflow Inference
5:47 Building a Visual Agent for "Computer Use" Automation
7:29 Live Demo: Qwen 3.5 Clicks Buttons and Starts a Training Job
8:38 Exploring Tool Calls and Future Use Cases
Up next
This NEW Chinese AI Agent is INSANE! (FREE!) 🤯
Julian Goldie SEO
Watch →