I explain Fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with Vision Pro
Build intuition about how scaling massive LLMs works. I cover two techniques for making LLM models train very fast, fully Sharded Data Parallel (FSDP) and pipeline parallelism in 3D with the Vision Pro. I'm excited to see how AR can help teach complex ideas easily.
Long time dream of mine to show conceptually, how I visualize these systems.
Chapters:
00:00 Introduction
01:02 Two machines each with 2 GPUs
01:37 Transformer models blocks
02:02 Forward pass
02:10 Backward pass
02:43 Fully Sharded Data Parallel introduction
02:51 Layer sharding
03:30 Weight concat
05:25 Memory upper bound
05:58 Why more GPUs speed up training
07:23 Shard across nodes (machines)
09:20 Sharding a block across nodes
10:14 Another way of seeing sharding
11:30 Understand interconnect bottleneck
12:00 Hybrid sharding
15:00 Pipeline parallelism
16:04 Forward pass in pipeline parallelism
16:10 Intuition around pipeline parallelism
16:50 Future directions on pipeline parallelism
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: LLM Foundations
View skill →Related AI Lessons
Chapters (19)
Introduction
1:02
Two machines each with 2 GPUs
1:37
Transformer models blocks
2:02
Forward pass
2:10
Backward pass
2:43
Fully Sharded Data Parallel introduction
2:51
Layer sharding
3:30
Weight concat
5:25
Memory upper bound
5:58
Why more GPUs speed up training
7:23
Shard across nodes (machines)
9:20
Sharding a block across nodes
10:14
Another way of seeing sharding
11:30
Understand interconnect bottleneck
12:00
Hybrid sharding
15:00
Pipeline parallelism
16:04
Forward pass in pipeline parallelism
16:10
Intuition around pipeline parallelism
16:50
Future directions on pipeline parallelism
🎓
Tutor Explanation
DeepCamp AI