Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
📰 ArXiv cs.AI
arXiv:2604.04037v1 Announce Type: cross Abstract: Knowledge distillation compresses large teachers into smaller students, but performance saturates at a loss floor that persists across training methods and objectives. We argue this floor is geometric: neural networks represent far more features than dimensions through superposition, and a student of width $d_S$ can encode at most $d_S \cdot g(\alpha)$ features, where $g(\alpha) = 1/((1-\alpha)\ln\frac{1}{1-\alpha})$ is a sparsity-dependent capac
DeepCamp AI