Steered LLM Activations are Non-Surjective
📰 ArXiv cs.AI
arXiv:2604.09839v1 Announce Type: new Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we
DeepCamp AI