Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Stor... Ankita Luthra & Trinadh Kotturu

PyTorch · Beginner ·🔧 Backend Engineering ·2w ago
Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Storage via fsspec to Keep GPUs Busy - Ankita Luthra & Trinadh Kotturu, Google As PyTorch models scale to billions of parameters, the bottleneck has quietly shifted from compute to storage. Modern GPU clusters often sit idle, "starving" for data while waiting on legacy REST-based protocols. This talk introduces Rapid Storage: a fundamental architectural shift bringing Google’s Colossus stateful protocol (that powers many Google’s products) to PyTorch via fsspec , a common Pythonic file interface used by many frameworks within PyTorch ecosystem. By bypassing REST APIs entirely via persistent gRPC streams to the storage layer, we eliminate protocol overhead. In this talk, we also dive into how Rapid achieves less than 1ms random read/write latency, 20x faster data access, and a massive 6 TB/s of aggregate throughput. Crucially, it delivers up to 10x lower tail latency for random I/O, preventing the stragglers that often stall distributed training jobs. Beyond raw speed, we will deconstruct the integration with gcsfs and the broader fsspec ecosystem. This ensures that high-performance I/O is available across the entire data stack including Dask, Ray, HF Datasets and vLLM etc. Join us to learn how to stop wasting GPU cycles and achieve linear scaling in the cloud.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

`setTimeout()` Is NOT Part of JavaScript
Learn why setTimeout() is not a part of JavaScript and how it's actually a part of the Web APIs, with implications for coding and understanding browser behavior
Dev.to · CodeWithIshwar
Installing Node.js and npm on Ubuntu 26.04
Learn to install the latest Node.js and npm on Ubuntu 26.04, bypassing the outdated default version
Dev.to · Sanskriti Harmukh
How to Modernize a Node.js Backend Without Rewriting It (Using Zuplo)
Learn how to modernize a Node.js backend without rewriting it using Zuplo, improving performance and scalability
Dev.to · Chidera Humphrey
Firebase for Startups: When to Switch to Enterprise Solutions
Learn when to switch from Firebase to enterprise solutions for your startup, and how to navigate the 300-500% yearly cost increase
Dev.to · Horizon Dev
Up next
Lovable AI + Kling 3.0 + Cookiebot = INSANE AI 3D Websites in Minutes (GDPR Ready)
Tin Rovic
Watch →