关注
Arthur Conmy
Arthur Conmy
Google DeepMind
在 google.com 的电子邮件经过验证 - 首页
标题
引用次数
引用次数
年份
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
K Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt
ICLR 2023, 2022
2382022
Towards Automated Circuit Discovery for Mechanistic Interpretability
A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso
NeurIPS 2023 Spotlight, 2023
1162023
Stealing Part of a Production Language Model
N Carlini, D Paleka, KD Dvijotham, T Steinke, J Hayase, AF Cooper, ...
ICML 2024 Oral, 2024
192024
Attribution Patching Outperforms Automated Circuit Discovery
A Syed, C Rager, A Conmy
NeurIPS 2023 Workshop (Attributing Model Behavior at Scale), 2023
162023
Copy Suppression: Comprehensively Understanding an Attention Head
C McDougall, A Conmy, C Rushing, T McGrath, N Nanda
NeurIPS 2023 Workshop (Attributing Model Behavior at Scale), 2023
162023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
R Gould, E Ong, G Ogden, A Conmy
ICLR 2024, 2023
112023
Interpreting Attention Layer Outputs with Sparse Autoencoders
C Kissane, R Krzyzanowski, JI Bloom, A Conmy, N Nanda
ICML 2024 Mechanistic Interpretability Workshop Spotlight, 2024
9*2024
Improving Dictionary Learning with Gated Sparse Autoencoders
S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ...
ICML 2024 Mechanistic Interpretability Workshop, 2024
8*2024
StyleGAN-induced Data-Driven Regularization for Inverse Problems
A Conmy, S Mukherjee, CB Schönlieb
IEEE ICASSP 2022, 2022
52022
Activation Steering with SAEs
A Conmy, N Nanda
www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg, 2024
2024
系统目前无法执行此操作,请稍后再试。
文章 1–10