POP: Prefill-Only Pruning for Efficient Large Model Inference
📰 ArXiv cs.AI
arXiv:2602.03295v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and
DeepCamp AI