Multimodal LLMs — DeepCamp Skills

After this skill you can…

Use GPT-4V / Claude Vision for image understanding
Build document OCR pipelines
Chain audio → text → action workflows

Prerequisites

LLM Foundations

Watch (10 videos)

Multimodal Requirements Development

Daniel Finkenstadt · advanced hands-on

→ Use GPT4 for multimodal interactions→ Derive technical requirements from oral problem statements

Gemini 3: Code a visualization of nuclear fusion

Google DeepMind · intermediate hands-on

→ Generate multimodal content→ Code a complex visual simulation

AI Generated Video Game is NOT SCI-FI Anymore!!!

1littlecoder · advanced hands-on

→ Generate interactive 3D worlds with AI→ Create procedural content for games

Google Veo 3 Tutorial: How to create AI Videos in Flow, Gemini or Google Vids?

AI Tool Journey · beginner hands-on

→ Generate AI videos with Google Veo 3→ Use Veo 3 in Gemini, Flow, and Google Vids

Ollama Multimodal: EASILY setup Llava locally & Integrate API

Mervin Praison · intermediate hands-on

→ Setup Ollama Multimodal with Llava→ Integrate multimodal AI API

JETSON AI LAB | One-Shot Multimodal RAG on Jetson Orin

NVIDIA Developer · beginner hands-on

→ Perform one-shot classification/recognition with multimodal RAG→ Tag images in vectorDB at runtime

RIP KLING AI! FREE NSFW 120s IMAGE TO VIDEO KING on 6 GB VRAM!

Aitrepreneur · beginner hands-on

→ Generate videos from images using FramePack→ Utilize Webui for video creation

REVOLUTIONARY FREE AI Model Inside Stable Diffusion! EDIT ANY IMAGE USING TEXT!

Aitrepreneur · beginner hands-on

→ Edit images using text prompts with InstructPix2Pix→ Install AI models in Stable Diffusion

Nano Banana Tutorial: How I Made Money Just Uploading AI Images!

Darrel Wilson · beginner hands-on

→ Create AI-generated images with Nano Banana→ Monetize AI content on online platforms

LlamaIndex Workshop: Multimodal + Advanced RAG Workhop with Gemini

LlamaIndex · intermediate hands-on

→ Build a multimodal RAG pipeline with LlamaIndex and Gemini→ Extract structured outputs from images using LLMs

Read (10 articles)

📄

Local Multimodal LLM on iOS with `llama.cpp` (Swift + ObjC++)

Dev.to · Timothy Fosteman · 2026-05-11

📄

Multimodal Gemma 4 Visual Regression & Patch Agent

Dev.to · Dickson Kanyingi · 2026-05-23

📄

APEX-1 Can Now See: Building a Vision-Language AI Architecture From Scratch

Medium · Deep Learning · 2026-05-30

📄

The Papers That Taught Machines to Comprehend

Medium · LLM · 2026-04-16

📄

MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games

ArXiv cs.AI · 2026-04-15

📄

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

ArXiv cs.AI · 2026-04-20

📄

Multimodal evaluators: MLLM-as-a-judge for image-to-text tasks in Strands Evals

AWS Machine Learning · 2026-05-20

📄

Power video semantic search with Amazon Nova Multimodal Embeddings

AWS Machine Learning · 2026-04-17

📄

Multimodal AI Explained: Text, Image, Audio and Video in One Tool

Dev.to AI · 2026-04-20

📄

Beyond the Prompt: How to Actually Work With Gemini Omni

Medium · AI · 2026-05-22