Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
📰 ArXiv cs.AI
arXiv:2604.09189v1 Announce Type: cross Abstract: LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as ty
DeepCamp AI