WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
📰 ArXiv cs.AI
arXiv:2604.08558v1 Announce Type: cross Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent
DeepCamp AI