Fixing GPU Starvation in Large-Scale Distributed Training
Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.
Fixing GPU Starvation in Large-Scale Distributed Training // MLOps Podcast #367 with Kashish Mittal, Staff Software Engineer at Uber
Join the Community: https://go.mlops.community/YTJoinIn
Get the newsletter: https://go.mlops.community/YTNewsletter
MLOps GPU Guide: https://go.mlops.community/gpuguide
// Abstract
Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML sca…
Watch on YouTube ↗
(saves to browser)
DeepCamp AI