Progress measures for grokking via mechanistic interpretability N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt arXiv preprint arXiv:2301.05217, 2023 | 206 | 2023 |
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla T Lieberum, M Rahtz, J Kramár, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023 | 38 | 2023 |
AtP*: An efficient and scalable method for localizing LLM behaviour to components J Kramár, T Lieberum, R Shah, N Nanda arXiv preprint arXiv:2403.00745, 2024 | 8 | 2024 |
Retrospective on the 2021 minerl BASALT competition on learning from human feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022 | 8 | 2022 |
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... arXiv preprint arXiv:2204.07123, 2022 | 1 | 2022 |
Replication: Fairness without demographics through Adversarially Reweighted Learning E Jenner, T Lieberum, FP Nolte, N Rutsch | | |