Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
📰 ArXiv cs.AI
arXiv:2602.04930v2 Announce Type: replace-cross Abstract: Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: ma
DeepCamp AI