Bringing PyTorch Monarch to AMD GPUs: Single-Controller Distributed Tra... Liz Li & Zachary Streeter

PyTorch · Advanced ·🏗️ Systems Design & Architecture ·2w ago
Bringing PyTorch Monarch to AMD GPUs: Single-Controller Distributed Training on ROCm - Liz Li & Zachary Streeter, AMD PyTorch Monarch introduces a new distributed programming paradigm that enables developers to orchestrate entire GPU clusters from a single Python program. With its actor-based runtime, process mesh abstraction, and asynchronous execution model, Monarch simplifies large-scale distributed training and enables complex workflows that combine training, evaluation, and reinforcement learning within one unified script. In this talk, we present our work enabling PyTorch Monarch on AMD Instinct GPUs with ROCm, expanding the single-controller model beyond CUDA environments and bringing this emerging runtime to a broader hardware ecosystem. We describe the engineering effort required to port Monarch’s GPU runtime and distributed communication stack to ROCm, including HIPification of CUDA-specific components, adaptation of memory management and synchronization semantics, and integration with high-performance GPU-to-GPU communication on multi-node clusters through RDMA. We will share lessons learned from running Monarch workloads on MI300-class clusters, including performance considerations, debugging workflows, and developer experience improvements. Our results demonstrate that Monarch’s architecture can be successfully extended to heterogeneous hardware environments while preserving scalability and ease of use. This work advances hardware diversity in distributed PyTorch and highlights how portable runtimes can simplify large-scale training while enabling scalable, cluster-wide experimentation across accelerator platforms.
Watch on YouTube ↗ (saves to browser)
Sign in to unlock AI tutor explanation · ⚡30

Related AI Lessons

My Load Balancer Handles 5M RPS: Architecture and Lessons Learned
Learn how to scale your load balancer to handle 5M RPS and the key lessons learned from this experience
Dev.to · speed engineer
ABAP CDS Views Series — Part 9: Hierarchies, Recursive Structures, and Tree-Based Data Modeling in SAP S/4HANA
Learn to model hierarchical and recursive data in SAP S/4HANA using ABAP CDS Views, enabling efficient querying and analysis of complex data structures.
Dev.to · Oktay Ates
Developers Are Building Backends in React Now
Learn how React Server Components and Next.js are changing the way developers build backends for modern web applications
Medium · Programming
Type-Driven Domain Design in Go: Encoding Invariants at Compile Time
Learn how to encode invariants at compile time in Go using type-driven domain design to reduce runtime bugs
Dev.to · Gabriel Anhaia
Up next
Microsoft Access Table Design & Data Integrity
Coursera
Watch →