Latest PyTorch's Secret Power to Handle Sequences of 10K or 100K Length
In this video, we'll be exploring a very cool feature in PyTorch 1.13+ that is so powerful but you may not even be harnessing the full power of; Flash attention. In this video I'll show you how conservative new Pytorch with memory and how it helps us to fit 10K or even 100K long sequences even on a modest GPU.
Flash attention repo: https://github.com/HazyResearch/flash-attention
Github code: https://github.com/thushv89/tutorials_deeplearninghero/blob/master/llms/flash_attention_torch.ipynb
00:00 - Introduction
00:28 - Scaled dot production in Pytorch
01:51 - Google colab environment
02:21 - Pytorch version for Flash Attention
02:51 - Input data
03:01 - Hyperparameters and the architecture
04:17 - Few important arguments to the model
04:55 - Utility functions
05:39 - Torch without Flash Attention
07:06 - Torch with Flash Attention
07:49 - Limitations of Flash Attention
08:59 - Analysing the results
10:47 - Conclusion
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
Related AI Lessons
⚡
⚡
⚡
⚡
I Tested 50 AI Prompts — Here’s the Formula That Always Works (2026)
Medium · ChatGPT
How to Use ChatGPT in Business: A Practical Guide for 2026
Medium · ChatGPT
The Complete Guide to Prompt Engineering: Unlock the Full Potential of AI
Medium · Machine Learning
How to Add AI Features to Your SaaS App Without Breaking Everything
Dev.to AI
Chapters (13)
Introduction
0:28
Scaled dot production in Pytorch
1:51
Google colab environment
2:21
Pytorch version for Flash Attention
2:51
Input data
3:01
Hyperparameters and the architecture
4:17
Few important arguments to the model
4:55
Utility functions
5:39
Torch without Flash Attention
7:06
Torch with Flash Attention
7:49
Limitations of Flash Attention
8:59
Analysing the results
10:47
Conclusion
🎓
Tutor Explanation
DeepCamp AI