Tracing a 13x PyTorch Slowdown to a Hidden NumPy Synchronization
📰 Dev.to · Ingero Team
TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization...
TL;DR: A .cpu().numpy() call buried inside a forward pass was forcing a full CPU-GPU synchronization...