Boost LLM performance: New SGLang course is live ๐Ÿš€

DeepLearningAI ยท Advanced ยท๐Ÿง  Large Language Models ยท3w ago
Learn more: https://bit.ly/4du2u69 Introducing Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys and RadixArk, and taught by Richard Chen a Member of Technical Staff at RadixArk. Running LLMs in production is expensive. Much of that cost comes from redundant computation: every new request forces the model to reprocess the same system prompt and shared context from scratch. SGLang is an open-source inference framework that eliminates that waste by caching computation that's already been done and reusing it across future requests. In this course, you'll build a clear mental model of how inference works (from input tokens to generated output) and learn why the memory bottleneck exists. From there, you'll implement the KV cache from scratch to store and reuse intermediate attention values within a single request. Then you'll go further with RadixAttention, SGLang's approach to sharing KV cache across requests by identifying common prefixes using a radix tree. Finally, you'll apply these same optimization principles to image generation using diffusion models. In detail, you'll: - Build a mental model of LLM inference: how a model processes input tokens, generates output token by token, and where the computational cost accumulates. - Implement the attention mechanism from scratch and build a KV cache to store and reuse intermediate key-value tensors, cutting redundant computation within a single request. - Extend caching across requests using SGLang's RadixAttention, which uses a radix tree to identify shared prefixes across users and skip repeated processing. - Apply SGLang's caching strategies to diffusion models for faster image generation, and explore multi-GPU parallelism for further acceleration. - Survey where the inference field is heading, including emerging techniques and how the optimization principles from this course apply to future developments. By the end, you'll have hands-on experience with the caching strat
Watch on YouTube โ†— (saves to browser)
Sign in to unlock AI tutor explanation ยท โšก30

Related AI Lessons

โšก
PagedAttention: vLLMโ€™s Solution to GPU Memory Waste
Learn how PagedAttention solves GPU memory waste for large language models (LLMs) and improve your LLM serving efficiency
Medium ยท ChatGPT
โšก
From 30 to 60 Tokens/Second: How I Got vLLM Running on 2x RTX 3090
Learn how to install and run vLLM on 2x RTX 3090 to achieve 60 tokens/second, a significant performance boost for LLM applications
Medium ยท LLM
โšก
Running an Offline LLM in React Native (2026): Building Privacy-First AI That Works Without theโ€ฆ
Learn to build a privacy-first offline LLM in React Native, enabling AI functionality without internet connectivity
Medium ยท LLM
โšก
Google Chrome is Now Automatically Downloading 4GB AI Models to User Computers: What You Need toโ€ฆ
Google Chrome now downloads 4GB AI models to user computers, understand the implications and how it affects your device
Medium ยท LLM
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch โ†’