Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
📰 ArXiv cs.AI
arXiv:2604.22782v1 Announce Type: cross Abstract: Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emph{depth} dimension
DeepCamp AI