Prune-Quantize-Distill: An Ordered Pipeline for Efficient Neural Network Compression

📰 ArXiv cs.AI

arXiv:2604.04988v1 Announce Type: cross Abstract: Modern deployment often requires trading accuracy for efficiency under tight CPU and memory constraints, yet common compression proxies such as parameter count or FLOPs do not reliably predict wall-clock inference time. In particular, unstructured sparsity can reduce model storage while failing to accelerate (and sometimes slightly slowing down) standard CPU execution due to irregular memory access and sparse kernel overhead. Motivated by this ga

Published 8 Apr 2026

Read full paper → ← Back to News