Build Visual AI Agents

DeepLearningAI · Intermediate ·🧠 Large Language Models ·2mo ago

Skills: Multimodal LLMs80%AI Design Tools70%

Key Takeaways

Builds visual AI agents for image and video generation using Google Cloud AI

Full Transcript

Our third century deals AI agents for image and video generation built in partnership with Google taught by Katie Newman and Wafa Bakelli. You might have worked with AI agents that produce text output, but human communication includes images, video, and audio. If you are writing a research report or building a product demo or creating an explain a video, you're probably dealing with visual media. I use Google's Nanu Bananu image generation and VEO video generation a lot and have been impressed with how quickly these capabilities are improving. And incorporating these into agentic workflows will let you do even more than the consumer versions of these products. In this course, you build AI agents that generate images and videos and evaluate the output automatically and iterate until they meet your quality standards. It turns out a key in building a media agent is in the evaluation or the evals. The core challenge is when the model generates an image or video from a prompt, there's no single correct answer to compare against [music] and so the quality depends on the context, the brand requirements, and on your use case. That's right, Andrew. You will learn three evaluation techniques that we work together. First, you will use fast image text models to calculate the similarity between the image outputs and your text prompt. Then you will also use an LLM based judge to evaluate against custom criteria like brand consistency or visual quality. This will give you actionable feedback you can feed back into the next generation. And finally, you will also use structured rubrics to break down complex prompts into verifiable questions. For example, is the subject in the frame? Does the camera motion match what you asked for? Or is the style consistent across scenes? If you're ready to build agents [music] beyond text, this course will get you there. I hope you enjoy the course.

Original Description

Learn more: https://bit.ly/43ctPTW Join our new short course, AI Agents for Image and Video Generation, built in partnership with Google and taught by Katie Nguyen, Developer Relations Engineer at Google Cloud AI, and Wafae Bakkali, Staff Generative AI Specialist at Google. Most agents you've worked with probably produce text. But whether you're building a product demo, a website asset, or an explainer video, you're working with visual media. With models like Google's Nano Banana for images and Veo for video, generating a single output from a prompt is straightforward. The harder problem is producing high-quality results consistently at scale, and the bottleneck there is evaluation: there is no single correct answer to compare against, so quality depends on context and use case. In this course, you'll learn three complementary evaluation techniques, then combine them with image and video generation to build autonomous media agents. You'll build an image agent that turns brand guidelines into UI mockups, and a video agent that plans multi-scene explainers, animates reference frames with synchronized audio, and checks consistency across scenes. In the final lesson, you'll use Gemini CLI to build a generative media agent in natural language, packaging what you've learned into reusable agent skills. In detail, you'll: - Get a clear mental model of the generative media landscape and the architectures behind image, video, and audio generation. - Engineer prompts for high-quality images and video, using techniques like LLM-enhanced prompting, reference images, and starting frames. - Build evaluation pipelines that combine SigLIP image-text similarity scores, LLM-based judges, and structured rubrics to assess output at scale. - Build an image agent that turns brand guidelines into UI mockups, generating, evaluating, and iterating until designs pass your bar. - Build a video agent that plans multi-scene explainers, generates and animates reference frames with audio, and

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: Multimodal LLMs

View skill →

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Clara Guardian Virtual Patient Assistant

NVIDIA Developer

Building Multimodal Search and RAG

Building Multimodal Search and RAG

Midjourney Trick: Consistent Character in Different Images

Midjourney Trick: Consistent Character in Different Images

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Ollama Multimodal: EASILY setup Llava locally & Integrate API

The ONLY Real Time Speech AI that can run locally!!!

The ONLY Real Time Speech AI that can run locally!!!

Related Reads

Qwen2 is here. It’s time to re-evaluate your default model choices.

Explore Qwen2, Alibaba Cloud's new open-source models, as a high-performing alternative to traditional choices for multilingual and long-context tasks

Dev.to · albe_sf

The Brain and Machines: What It Really Means to Say AI Is “Inspired by the Brain”

Discover the true meaning of AI being inspired by the brain and its implications

The Brain and Machines: What It Really Means to Say AI Is “Inspired by the Brain”

Discover what it means for AI to be inspired by the brain and the limitations of this concept

Medium · Machine Learning

How to Use Chat GPT to Make Money Online (Complete Beginner’s Guide for 2026)

Learn how to leverage ChatGPT to generate online income with this beginner's guide

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)