torch.compile and Diffusers: A Hands-On Guide to Peak Performance - Sayak Paul, Hugging Face

Name: torch.compile and Diffusers: A Hands-On Guide to Peak Performance - Sayak Paul, Hugging Face
Uploaded: 2026-04-20T20:22:24Z
Channel: PyTorch
Description: torch.compile and Diffusers: A Hands-On Guide to Peak Performance - Sayak Paul, Hugging Face This session shows how to use torch.compile with the Diffus...

PyTorch · Beginner ·🎨 Image & Video AI ·2w ago

Skills: Image Generation Basics90%AI Pair Programming70%

torch.compile and Diffusers: A Hands-On Guide to Peak Performance - Sayak Paul, Hugging Face This session shows how to use torch.compile with the Diffusers library to speed up diffusion models like Flux-1-Dev. You'll learn practical techniques for both model authors and users. For authors, we cover how to make models compiler-friendly using fullgraph=True. For users, we explain regional compilation (which cuts compile time by 7x while keeping the same runtime gains) and how to avoid recompilations with dynamic=True. We also cover real-world scenarios: running on memory-constrained GPUs using CPU offloading and quantization, and swapping LoRA adapters without triggering recompilation. Key takeaways: - Compiling just the Diffusion Transformer (DiT) delivers ~1.5x speedup on H100 - Regional compilation reduces cold-start compile time from 67s to 9.6s - NF4 quantization cuts memory from 33GB to 15GB - Combining quantization + offloading drops memory to 12.2GB - LoRA hot-swap lets you switch adapters without recompiling Whether you're building diffusion models or using them, this guide helps you get the best performance with minimal effort.

Watch on YouTube ↗ (saves to browser)