Deploying Mixtral on GKE with just 2 x 24 GB L4 GPUs
Lingo, open source ML Proxy and autoscaler for K8s: https://github.com/substratusai/lingo
Blog post with copy pasteable instructions: https://www.substratus.ai/blog/deploying-mixtral-gptq-on-gke-l4-gpus
Learn how to deploy Mixtral on GKE using just 2 x L4 24 GB GPUs. We do this by using GPTQ which loads Mixtral on 4 bit mode.
0:00 - Introduction
0:12 - Calculating GPU memory required for Mixtral with GPTQ
1:40 - High-level overview of the steps to deploy Mixtral on GKE
2:20 - Create GKE cluster with L4 GPU nodepool
3:35 - Download the Mixtral model weights to PVC using K8s job
5:45 - Deploy Mixtral using the Helm vLLM chart
9:19 - Validate Mixtral is up and running send a prompt
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Engineering
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
10 Benefits of Learning Generative AI in 2026 (Complete Guide for Beginners & Professionals)
Medium · AI
If You Understand These 5 AI Terms, You’re Ahead of 90% of People
Medium · AI
How I Accidentally Spent $800/Month on LLM Tokens I Didn't Need (And How to Fix It)
Dev.to · Nate Voss
Gemma 4: Bringing Frontier AI to Consumer Hardware
Medium · AI
Chapters (7)
Introduction
0:12
Calculating GPU memory required for Mixtral with GPTQ
1:40
High-level overview of the steps to deploy Mixtral on GKE
2:20
Create GKE cluster with L4 GPU nodepool
3:35
Download the Mixtral model weights to PVC using K8s job
5:45
Deploy Mixtral using the Helm vLLM chart
9:19
Validate Mixtral is up and running send a prompt
🎓
Tutor Explanation
DeepCamp AI