SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

📰 ArXiv cs.AI

arXiv:2604.12247v1 Announce Type: cross Abstract: Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layer

Published 15 Apr 2026
Read full paper → ← Back to Reads