Dynamic sparsity in tree-structured feed-forward layers at scale
📰 ArXiv cs.AI
arXiv:2604.08565v1 Announce Type: cross Abstract: At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree
DeepCamp AI