Taking features out of superposition with sparse autoencoders L Sharkey, D Braun, B Millidge AI Alignment Forum 6, 12-13, 2022 | 19* | 2022 |
Interpreting neural networks through the polytope lens S Black, L Sharkey, L Grinsztajn, E Winsor, D Braun, J Merizian, K Parker, ... arXiv preprint arXiv:2211.12312, 2022 | 18 | 2022 |
A Causal Framework for AI Regulation and Auditing L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ... Preprints, 2024 | 14* | 2024 |
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning D Braun, J Taylor, N Goldowsky-Dill, L Sharkey arXiv preprint arXiv:2405.12241, 2024 | 12 | 2024 |
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability L Bushnaq, J Mendel, S Heimersheim, D Braun, N Goldowsky-Dill, ... arXiv preprint arXiv:2405.10927, 2024 | 5 | 2024 |
Towards evaluations-based safety cases for AI scheming M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ... arXiv preprint arXiv:2411.03336, 2024 | 2 | 2024 |
The local interaction basis: Identifying computationally-relevant and sparsely interacting features in neural networks L Bushnaq, S Heimersheim, N Goldowsky-Dill, D Braun, J Mendel, ... arXiv preprint arXiv:2405.10928, 2024 | 2 | 2024 |
Construction and Elicitation of a Black Box Model in the Game of Bridge V Ventos, D Braun, C Deheeger, JP Desmoulins, JB Fantun, S Legras, ... Advances in Knowledge Discovery and Management: Volume 10, 29-53, 2024 | | 2024 |