Deploying Mixtral on GKE with just 2 x 24 GB L4 GPUs

Samos123 · Beginner ·🧠 Large Language Models ·2y ago

Skills: LLM Engineering90%Prompt Craft60%

Lingo, open source ML Proxy and autoscaler for K8s: https://github.com/substratusai/lingo Blog post with copy pasteable instructions: https://www.substratus.ai/blog/deploying-mixtral-gptq-on-gke-l4-gpus Learn how to deploy Mixtral on GKE using just 2 x L4 24 GB GPUs. We do this by using GPTQ which loads Mixtral on 4 bit mode. 0:00 - Introduction 0:12 - Calculating GPU memory required for Mixtral with GPTQ 1:40 - High-level overview of the steps to deploy Mixtral on GKE 2:20 - Create GKE cluster with L4 GPU nodepool 3:35 - Download the Mixtral model weights to PVC using K8s job 5:45 - Deploy Mixtral using the Helm vLLM chart 9:19 - Validate Mixtral is up and running send a prompt

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Advanced AI and Machine Learning Techniques and Capstone

Advanced AI and Machine Learning Techniques and Capstone

AI Development with DeepSeek for Developers

AI Development with DeepSeek for Developers

I built the most expensive CPU ever! (Every instruction is a prompt)

I built the most expensive CPU ever! (Every instruction is a prompt)

Related AI Lessons

10 Benefits of Learning Generative AI in 2026 (Complete Guide for Beginners & Professionals)

Unlock 10 benefits of learning Generative AI in 2026 and boost your career across industries

If You Understand These 5 AI Terms, You’re Ahead of 90% of People

Understand 5 key AI terms to gain a competitive edge in the field

How I Accidentally Spent $800/Month on LLM Tokens I Didn't Need (And How to Fix It)

Learn how to avoid overspending on LLM tokens by optimizing API calls and implementing cost-effective solutions

Dev.to · Nate Voss

Gemma 4: Bringing Frontier AI to Consumer Hardware

Learn about Gemma 4, Google's new AI technology that brings frontier AI to consumer hardware, and its potential impact on the AI landscape

Chapters (7)

Introduction

0:12 Calculating GPU memory required for Mixtral with GPTQ

1:40 High-level overview of the steps to deploy Mixtral on GKE

2:20 Create GKE cluster with L4 GPU nodepool

3:35 Download the Mixtral model weights to PVC using K8s job

5:45 Deploy Mixtral using the Helm vLLM chart

9:19 Validate Mixtral is up and running send a prompt

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)