WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

📰 ArXiv cs.AI

arXiv:2604.08558v1 Announce Type: cross Abstract: Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent

Published 13 Apr 2026
Read full paper → ← Back to Reads