Efficiently Serving LLMs

External: Coursera Courses ↗ · Coursera

Open Course on External: Coursera

Free to audit · Opens on External: Coursera

Efficiently Serving LLMs

Coursera · Intermediate ·🧠 Large Language Models ·2mo ago
Join our new short course, Efficiently Serving Large Language Models, to build a ground-up understanding of how to serve LLM applications from Travis Addair, CTO at Predibase. Whether you’re ready to launch your own application or just getting started building it, the topics you’ll explore in this course will deepen your foundational knowledge of how LLMs work, and help you better understand the performance trade-offs you must consider when building LLM applications that will serve large numbers of users. You’ll walk through the most important optimizations that allow LLM vendors to efficiently serve models to many customers, including strategies for working with multiple fine-tuned models at once. In this course, you will: 1. Learn how auto-regressive large language models generate text one token at a time. 2. Implement the foundational elements of a modern LLM inference stack in code, including KV caching, continuous batching, and model quantization, and benchmark their impacts on inference throughput and latency. 3. Explore the details of how LoRA adapters work, and learn how batching techniques allow different LoRA adapters to be served to multiple customers simultaneously. 4. Get hands-on with Predibase’s LoRAX framework inference server to see these optimization techniques implemented in a real world LLM inference server. Knowing more about how LLM servers operate under the hood will greatly enhance your understanding of the options you have to increase the performance and efficiency of your LLM-powered applications.
Watch on External: Coursera ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

Why Was Claude Fable 5 Blocked? The AI Model That Sparked a Global Debate
Learn about Claude Fable 5, the AI model that sparked a global debate, and understand its implications on the AI industry
Medium · AI
Why Was Claude Fable 5 Blocked? The AI Model That Sparked a Global Debate
Learn about Claude Fable 5, the AI model that sparked a global debate, and understand its implications on the AI industry
Medium · Programming
How Does Anthropic‘s Data Retention Policy for Mythos-class Models Compare to Similar Models?
Learn how Anthropic's 30-day data retention policy for Mythos-class models compares to similar models and its implications for enterprise customers
Medium · ChatGPT
How ChatGPT Answers “Best Pizza Near Me”
Learn how ChatGPT answers location-based queries like 'best pizza near me' by leveraging a search engine and web results
Medium · SEO
Up next
5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems
Dave Ebbelaar (LLM Eng)
Watch →