Language models are few-shot learners T Brown, B Mann, N Ryder, M Subbiah, JD Kaplan, P Dhariwal, ... Advances in neural information processing systems 33, 1877-1901, 2020 | 34743* | 2020 |
Learning transferable visual models from natural language supervision A Radford, JW Kim, C Hallacy, A Ramesh, G Goh, S Agarwal, G Sastry, ... International conference on machine learning, 8748-8763, 2021 | 18467 | 2021 |
Training language models to follow instructions with human feedback L Ouyang, J Wu, X Jiang, D Almeida, C Wainwright, P Mishkin, C Zhang, ... Advances in neural information processing systems 35, 27730-27744, 2022 | 7518 | 2022 |
Training a helpful and harmless assistant with reinforcement learning from human feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022 | 934 | 2022 |
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... arXiv preprint arXiv:2206.04615, 2022 | 859 | 2022 |
Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 746 | 2022 |
Release strategies and the social impacts of language models I Solaiman, M Brundage, J Clark, A Askell, A Herbert-Voss, J Wu, ... arXiv preprint arXiv:1908.09203, 2019 | 446 | 2019 |
Toward trustworthy AI development: mechanisms for supporting verifiable claims M Brundage, S Avin, J Wang, H Belfield, G Krueger, G Hadfield, H Khlaaf, ... arXiv preprint arXiv:2004.07213, 2020 | 352 | 2020 |
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022 | 289 | 2022 |
A general language assistant as a laboratory for alignment A Askell, Y Bai, A Chen, D Drain, D Ganguli, T Henighan, A Jones, ... arXiv preprint arXiv:2112.00861, 2021 | 276 | 2021 |
In-context learning and induction heads C Olsson, N Elhage, N Nanda, N Joseph, N DasSarma, T Henighan, ... arXiv preprint arXiv:2209.11895, 2022 | 225 | 2022 |
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022 | 219 | 2022 |
Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan arXiv preprint arXiv:2207.05221, 2022 | 195 | 2022 |
A mathematical framework for transformer circuits N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread 1 (1), 12, 2021 | 192 | 2021 |
Training language models to follow instructions with human feedback, 2022 L Ouyang, J Wu, X Jiang, D Almeida, CL Wainwright, P Mishkin, C Zhang, ... URL https://arxiv. org/abs/2203.02155 13, 1, 2022 | 180 | 2022 |
Language Models are Few-Shot Learners. 2020. doi: 10.48550 TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, ... arxiv, 5-7, 2005 | 171 | 2005 |
Discovering language model behaviors with model-written evaluations E Perez, S Ringer, K Lukošiūtė, K Nguyen, E Chen, S Heiner, C Pettit, ... arXiv preprint arXiv:2212.09251, 2022 | 161 | 2022 |
Dawn Drain N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson …, 2021 | 140 | 2021 |
Dawn Drain C Olsson, N Elhage, NJ Neel Nanda, N DasSarma, T Henighan, B Mann, ... Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy …, 2022 | 128 | 2022 |
The capacity for moral self-correction in large language models D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ... arXiv preprint arXiv:2302.07459, 2023 | 112 | 2023 |