QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference
📰 ArXiv cs.AI
arXiv:2604.08585v1 Announce Type: cross Abstract: Cache fusion accelerates generation process of LLMs equipped with RAG through KV caching and selective token recomputation, thereby reducing computational costs and improving efficiency. However, existing methods primarily rely on local perspectives for token selection and lack global awareness from the user query. Utilizing this global awareness is challenging due to the high cost of obtaining context-aware query representations and the strict p
DeepCamp AI