I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro

william falcon · Beginner ·🧠 Large Language Models ·2y ago

Skills: LLM Foundations90%LLM Engineering80%Fine-tuning LLMs70%

Build intuition about how scaling massive LLMs works. I cover two techniques for making LLM models train very fast, fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with the Vision Pro. I'm excited to see how AR can help teach complex ideas easily. Long time dream of mine to show conceptually, how I visualize these systems. Chapters: 00:00 Introduction 01:02 Two machines each with 2 GPUs 01:37 Transformer models blocks 02:02 Forward pass 02:10 Backward pass 02:43 Fully Sharded Data Parallel introduction 02:51 Layer sharding 03:30 Weight concat 05:25 Memory upper bound 05:58 Why more GPUs speed up training 07:23 Shard across nodes (machines) 09:20 Sharding a block across nodes 10:14 Another way of seeing sharding 11:30 Understand interconnect bottleneck 12:00 Hybrid sharding 15:00 Pipeline parallelism 16:04 Forward pass in pipeline parallelism 16:10 Intuition around pipeline parallelism 16:50 Future directions on pipeline parallelism

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Foundations

View skill →

Getting Started with Vertex AI Gemini 1.5 Flash

How to use the ChatGPT API with Python!!

How to use the ChatGPT API with Python!!

Nicholas Renotte

Gemini 2.5: Create an interactive plot of economic data

Gemini 2.5: Create an interactive plot of economic data

Google DeepMind

LangChain Chatbots: Building a Personalized AI Assistant

LangChain Chatbots: Building a Personalized AI Assistant

Analytics Vidhya

Auto-generating meeting notes with Python

Auto-generating meeting notes with Python

Beginners Tutorial to Upload Github Jupyter Notebook to Google Colab

Beginners Tutorial to Upload Github Jupyter Notebook to Google Colab

Related AI Lessons

The Orbital Response Network

Learn about the Orbital Response Network, a concept network architecture similar to transformers, and its potential applications

Measuring What Matters with NeMo Agent Toolkit

Learn to measure what matters in LLMs using NeMo Agent Toolkit for observability, evaluations, and model comparisons

Without google's transformers, there is no GPT-ishs

Learn how Google's Transformers enabled the creation of GPT-2 and the modern generative AI industry

Cache-Augmented Generation (CAG): A RAG-less Approach to Document QA

Learn about Cache-Augmented Generation (CAG), a novel approach to document QA that eliminates the need for Retrieval-Augmented Generation (RAG)

Medium · Machine Learning

Chapters (19)

Introduction

1:02 Two machines each with 2 GPUs

1:37 Transformer models blocks

2:02 Forward pass

2:10 Backward pass

2:43 Fully Sharded Data Parallel introduction

2:51 Layer sharding

3:30 Weight concat

5:25 Memory upper bound

5:58 Why more GPUs speed up training

7:23 Shard across nodes (machines)

9:20 Sharding a block across nodes

10:14 Another way of seeing sharding

11:30 Understand interconnect bottleneck

12:00 Hybrid sharding

15:00 Pipeline parallelism

16:04 Forward pass in pipeline parallelism

16:10 Intuition around pipeline parallelism

16:50 Future directions on pipeline parallelism

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)