Automated Attention Pattern Discovery at Scale in Large Language Models
📰 ArXiv cs.AI
arXiv:2604.03764v1 Announce Type: cross Abstract: Large language models have found success by scaling up capabilities to work in general settings. The same can unfortunately not be said for interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language mo
DeepCamp AI