SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

📰 ArXiv cs.AI

arXiv:2604.01473v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from the randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jai

Published 16 Apr 2026
Read full paper → ← Back to Reads