The malicious use of artificial intelligence: Forecasting, prevention, and mitigation M Brundage, S Avin, J Clark, H Toner, P Eckersley, B Garfinkel, A Dafoe, ... arXiv preprint arXiv:1802.07228, 2018 | 1153* | 2018 |
When will AI exceed human performance? Evidence from AI experts K Grace, J Salvatier, A Dafoe, B Zhang, O Evans Journal of Artificial Intelligence Research 62, 729-754, 2018 | 1072* | 2018 |
Truthfulqa: Measuring how models mimic human falsehoods S Lin, J Hilton, O Evans arXiv preprint arXiv:2109.07958, 2021 | 907 | 2021 |
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... arXiv preprint arXiv:2206.04615, 2022 | 887 | 2022 |
Trial without error: Towards safe reinforcement learning via human intervention W Saunders, G Sastry, A Stuhlmueller, O Evans arXiv preprint arXiv:1707.05173, 2017 | 302 | 2017 |
Help or hinder: Bayesian models of social goal inference T Ullman, C Baker, O Macindoe, O Evans, N Goodman, J Tenenbaum Advances in neural information processing systems 22, 2009 | 222 | 2009 |
Teaching models to express their uncertainty in words S Lin, J Hilton, O Evans arXiv preprint arXiv:2205.14334, 2022 | 177 | 2022 |
Learning the Preferences of Ignorant, Inconsistent Agents O Evans, A Stuhlmüller, ND Goodman Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-2016), 2016 | 137 | 2016 |
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023 | 120* | 2023 |
Truthful AI: Developing and governing AI that does not lie O Evans, O Cotton-Barratt, L Finnveden, A Bales, A Balwit, P Wills, ... arXiv preprint arXiv:2110.06674, 2021 | 84 | 2021 |
Agent-Agnostic Human-in-the-Loop Reinforcement Learning D Abel, J Salvatier, A Stuhlmüller, O Evans arXiv:1701.0407, 2017 | 79 | 2017 |
AI progress measurement P Eckersley, Y Nasser, Y Bayle, O Evans, G Gebhart, D Schwenk Electronic Frontier Foundation, 2017 | 51* | 2017 |
Constructing and adjusting estimates for household transmission of SARS-CoV-2 from prior studies, widespread-testing and contact-tracing data M Curmei, A Ilyas, O Evans, J Steinhardt International Journal of Epidemiology 50 (5), 1444-1457, 2021 | 40* | 2021 |
Active Reinforcement Learning: Observing Rewards at a Cost D Krueger, J Leike, O Evans, J Salvatier NIPS 2016 Workshop, 2016 | 36* | 2016 |
Learning the Preferences of Bounded Agents O Evans, A Stuhlmüller, ND Goodman Advances in Neural Information Processing Systems (Bounded Optimality Workshop), 2015 | 36 | 2015 |
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023 | 28* | 2023 |
Modeling Agents with Probabilistic Programs O Evans, A Stuhlmüller, J Salvatier, D Filan agentmodels.org, 2017 | 28* | 2017 |
How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions L Pacchiardi, AJ Chan, S Mindermann, I Moscovitz, AY Pan, Y Gal, ... arXiv preprint arXiv:2309.15840, 2023 | 23 | 2023 |
Learning structured preferences O Evans, L Bergen, JB Tenenbaum Proceedings of the 32nd annual conference of the cognitive science society, 2010 | 21* | 2010 |
Modelling the Health and Economic Impacts of Population-Wide Testing, Contact Tracing and Isolation (PTTI) Strategies for COVID-19 in the UK T Colbourn, W Waites, J Panovska-Griffiths, D Manheim, S Sturniolo, ... SSRN, 2020 | 17* | 2020 |