Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Stor... Ankita Luthra & Trinadh Kotturu

Name: Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Stor... Ankita Luthra & Trinadh Kotturu
Uploaded: 2026-04-20T20:22:24Z
Channel: PyTorch
Description: Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Storage via fsspec to Keep GPUs Busy - Ankita Luthra & Trinadh Kotturu, Google As PyTorch m...

PyTorch · Beginner ·🔧 Backend Engineering ·2w ago

Skills: Backend Performance90%API Design70%

Lightning Talk: Bringing Google’s Colossus to PyTorch: Rapid Storage via fsspec to Keep GPUs Busy - Ankita Luthra & Trinadh Kotturu, Google As PyTorch models scale to billions of parameters, the bottleneck has quietly shifted from compute to storage. Modern GPU clusters often sit idle, "starving" for data while waiting on legacy REST-based protocols. This talk introduces Rapid Storage: a fundamental architectural shift bringing Google’s Colossus stateful protocol (that powers many Google’s products) to PyTorch via fsspec , a common Pythonic file interface used by many frameworks within PyTorch ecosystem. By bypassing REST APIs entirely via persistent gRPC streams to the storage layer, we eliminate protocol overhead. In this talk, we also dive into how Rapid achieves less than 1ms random read/write latency, 20x faster data access, and a massive 6 TB/s of aggregate throughput. Crucially, it delivers up to 10x lower tail latency for random I/O, preventing the stragglers that often stall distributed training jobs. Beyond raw speed, we will deconstruct the integration with gcsfs and the broader fsspec ecosystem. This ensures that high-performance I/O is available across the entire data stack including Dask, Ray, HF Datasets and vLLM etc. Join us to learn how to stop wasting GPU cycles and achieve linear scaling in the cloud.

Watch on YouTube ↗ (saves to browser)