Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

📰 ArXiv cs.AI

arXiv:2602.04930v2 Announce Type: replace-cross Abstract: Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: ma

Published 17 Apr 2026
Read full paper → ← Back to Reads