Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Stor... Ankita Luthra & Trinadh Kotturu
Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Storage via fsspec to Keep GPUs Busy - Ankita Luthra & Trinadh Kotturu, Google
As PyTorch models scale to billions of parameters, the bottleneck has quietly shifted from compute to storage. Modern GPU clusters often sit idle, "starving" for data while waiting on legacy REST-based protocols. This talk introduces Rapid Storage: a fundamental architectural shift bringing Google’s Colossus stateful protocol (that powers many Google’s products) to PyTorch via fsspec , a common Pythonic file interface used by many frameworks within PyTorch ecosystem.
By bypassing REST APIs entirely via persistent gRPC streams to the storage layer, we eliminate protocol overhead. In this talk, we also dive into how Rapid achieves less than 1ms random read/write latency, 20x faster data access, and a massive 6 TB/s of aggregate throughput. Crucially, it delivers up to 10x lower tail latency for random I/O, preventing the stragglers that often stall distributed training jobs.
Beyond raw speed, we will deconstruct the integration with gcsfs and the broader fsspec ecosystem. This ensures that high-performance I/O is available across the entire data stack including Dask, Ray, HF Datasets and vLLM etc. Join us to learn how to stop wasting GPU cycles and achieve linear scaling in the cloud.
Watch on YouTube ↗
(saves to browser)
Sign in to unlock AI tutor explanation · ⚡30
More on: Backend Performance
View skill →Related AI Lessons
⚡
⚡
⚡
⚡
`setTimeout()` Is NOT Part of JavaScript
Dev.to · CodeWithIshwar
Installing Node.js and npm on Ubuntu 26.04
Dev.to · Sanskriti Harmukh
How to Modernize a Node.js Backend Without Rewriting It (Using Zuplo)
Dev.to · Chidera Humphrey
Firebase for Startups: When to Switch to Enterprise Solutions
Dev.to · Horizon Dev
🎓
Tutor Explanation
DeepCamp AI