Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt ICLR 2023, 2022 | 238 | 2022 |
Towards Automated Circuit Discovery for Mechanistic Interpretability A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso NeurIPS 2023 Spotlight, 2023 | 116 | 2023 |
Stealing Part of a Production Language Model N Carlini, D Paleka, KD Dvijotham, T Steinke, J Hayase, AF Cooper, ... ICML 2024 Oral, 2024 | 19 | 2024 |
Attribution Patching Outperforms Automated Circuit Discovery A Syed, C Rager, A Conmy NeurIPS 2023 Workshop (Attributing Model Behavior at Scale), 2023 | 16 | 2023 |
Copy Suppression: Comprehensively Understanding an Attention Head C McDougall, A Conmy, C Rushing, T McGrath, N Nanda NeurIPS 2023 Workshop (Attributing Model Behavior at Scale), 2023 | 16 | 2023 |
Successor Heads: Recurring, Interpretable Attention Heads In The Wild R Gould, E Ong, G Ogden, A Conmy ICLR 2024, 2023 | 11 | 2023 |
Interpreting Attention Layer Outputs with Sparse Autoencoders C Kissane, R Krzyzanowski, JI Bloom, A Conmy, N Nanda ICML 2024 Mechanistic Interpretability Workshop Spotlight, 2024 | 9* | 2024 |
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... ICML 2024 Mechanistic Interpretability Workshop, 2024 | 8* | 2024 |
StyleGAN-induced Data-Driven Regularization for Inverse Problems A Conmy, S Mukherjee, CB Schönlieb IEEE ICASSP 2022, 2022 | 5 | 2022 |
Activation Steering with SAEs A Conmy, N Nanda www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg, 2024 | | 2024 |