Steered LLM Activations are Non-Surjective

📰 ArXiv cs.AI

arXiv:2604.09839v1 Announce Type: new Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we

Published 14 Apr 2026

Read full paper → ← Back to Reads