Diagonal-Tiled Mixed-Precision Attention for Efficient Low-Bit MXFP Inference
📰 ArXiv cs.AI
arXiv:2604.03950v1 Announce Type: cross Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable performance across a wide range of real-world tasks, but their inference cost remains prohibitively high due to the quadratic complexity of attention and the memory bandwidth limitations of high-precision operations. In this work, we present a low-bit mixed-precision attention kernel using the microscaling floating-point (MXFP) data format, utilizing the computing capabil
DeepCamp AI