TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol
📰 ArXiv cs.AI
arXiv:2604.03968v1 Announce Type: cross Abstract: AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, c
DeepCamp AI