Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models
📰 ArXiv cs.AI
arXiv:2604.10681v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: o
DeepCamp AI