Why Do Large Language Models Generate Harmful Content?
📰 ArXiv cs.AI
arXiv:2604.11663v1 Announce Type: new Abstract: Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate tha
DeepCamp AI