How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

📰 ArXiv cs.AI

arXiv:2604.04385v1 Announce Type: cross Abstract: We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0

Published 7 Apr 2026
Read full paper → ← Back to News