How 120B+ Parameter Models Run on One GPU (The MoE Secret)

Shane | LLM Implementation · Advanced ·🧠 Large Language Models ·8mo ago
How is it possible for a 120 billion parameter AI model to run on a single consumer GPU? This isn't magic—it's the result of brilliant breakthroughs in AI architecture. In this video, we uncover the secrets behind running massive Large Language Models (LLMs) efficiently, focusing on the revolutionary Mixture of Experts (MoE) architecture. We'll take you on a journey from the limitations of older Recurrent Neural Networks (RNNs) to the parallel power of the Transformer. Learn how the industry moved beyond inefficient brute-force scaling by adopting the Chinchilla Scaling Law for optimal training, and finally, how Mixture of Experts (MoE) allows us to leverage immense knowledge without incurring massive computational costs. This is the key to today's powerful, knowledgeable, and cost-effective AI. 🕒 Timestamps / Chapters: 00:00 - The 120 Billion Parameter Paradox 00:30 - Problem 1: The RNN Memory Wall & Sequential Processing 01:10 - Breakthrough 1: The Transformer & Attention Mechanism 02:09 - The Brute-Force Scaling Trap & Diminishing Returns 02:48 - Breakthrough 2: The Chinchilla Scaling Law for Optimal Training 03:44 - Breakthrough 3: Mixture of Experts (MoE) - The Ultimate Efficiency Hack 04:00 - Dense Models vs. Sparse MoE Models Explained 04:55 - DEMO: A Tour of Modern MoE Models 05:00 - OpenAI: gpt-oss-120b 05:18 - Qwen: Qwen3-Coder-480B 05:36 - Zhipu: GLM-4.5-Air-Base 06:03 - Summary: How Architecture Became the Ultimate Strategy 🤖 Models Featured in this Video: OpenAI gpt-oss-120b: 117 Billion Total Parameters | 5.1 Billion Active Qwen Qwen3-Coder-480B: 480 Billion Total Parameters | 35 Billion Active Zhipu GLM-4.5-Air-Base: 106 Billion Total Parameters | 12 Billion Active 📚 Resources & Further Learning: "Attention Is All You Need" Paper: https://arxiv.org/abs/1706.03762 DeepMind's Chinchilla Paper: https://arxiv.org/abs/2203.15556 Explore MoE Models on Hugging Face: https://huggingface.co/blog/moe This presentation is inspired by the core concepts in
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Chapters (12)

The 120 Billion Parameter Paradox
0:30 Problem 1: The RNN Memory Wall & Sequential Processing
1:10 Breakthrough 1: The Transformer & Attention Mechanism
2:09 The Brute-Force Scaling Trap & Diminishing Returns
2:48 Breakthrough 2: The Chinchilla Scaling Law for Optimal Training
3:44 Breakthrough 3: Mixture of Experts (MoE) - The Ultimate Efficiency Hack
4:00 Dense Models vs. Sparse MoE Models Explained
4:55 DEMO: A Tour of Modern MoE Models
5:00 OpenAI: gpt-oss-120b
5:18 Qwen: Qwen3-Coder-480B
5:36 Zhipu: GLM-4.5-Air-Base
6:03 Summary: How Architecture Became the Ultimate Strategy
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →