The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams
📰 ArXiv cs.AI
LatentBiopsy detects harmful prompts in LLMs using angular deviation in residual streams without training
Action Steps
- Compute the leading principal component of activations for 200 safe normative prompts at a target layer
- Characterise new prompts by their radial deviation angle from the reference direction
- Calculate the anomaly score as the negative log-likelihood of the deviation angle
- Use the anomaly score to detect harmful prompts
Who Needs to Know This
AI researchers and engineers working on LLMs can benefit from this method to detect harmful prompts, and it can be used by product managers to improve the safety of their AI-powered products
Key Insight
💡 Angular deviation in residual streams can be used to detect harmful prompts in LLMs without training
Share This
💡 Detect harmful prompts in LLMs without training! LatentBiopsy uses angular deviation in residual streams
DeepCamp AI