Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 259 | 2023 |
Pretraining language models with human preferences T Korbak, K Shi, A Chen, RV Bhalerao, C Buckley, J Phang, SR Bowman, ... International Conference on Machine Learning, 17506-17533, 2023 | 133 | 2023 |
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023 | 126* | 2023 |
Inverse scaling: When bigger isn't better IR McKenzie, A Lyzhov, M Pieler, A Parrish, A Mueller, A Prabhu, ... arXiv preprint arXiv:2306.09479, 2023 | 86* | 2023 |
Towards understanding sycophancy in language models M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ... arXiv preprint arXiv:2310.13548, 2023 | 72 | 2023 |
Training language models with language feedback at scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023 | 72 | 2023 |
Improving code generation by training with natural language feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023 | 42 | 2023 |
Aligning language models with preferences through f-divergence minimization D Go, T Korbak, G Kruszewski, J Rozen, N Ryu, M Dymetman arXiv preprint arXiv:2302.08215, 2023 | 42 | 2023 |
RL with KL penalties is better viewed as Bayesian inference T Korbak, E Perez, CL Buckley arXiv preprint arXiv:2205.11275, 2022 | 39* | 2022 |
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | 34 | 2024 |
On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting T Korbak, H Elsahar, G Kruszewski, M Dymetman Advances in Neural Information Processing Systems 35, 16203-16220, 2022 | 32 | 2022 |
Computational enactivism under the free energy principle T Korbak Synthese 198 (3), 2743-2763, 2021 | 31 | 2021 |
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023 | 30* | 2023 |
Controlling conditional language models without catastrophic forgetting T Korbak, H Elsahar, G Kruszewski, M Dymetman International Conference on Machine Learning, 11499-11528, 2022 | 26 | 2022 |
Many-shot jailbreaking C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ... Anthropic, April, 2024 | 20* | 2024 |
Interaction history as a source of compositionality in emergent communication T Korbak, J Zubek, Ł Kuciński, P Miłoś, J Rączaszek-Leonardi Interaction Studies 22 (2), 212-243, 2021 | 17* | 2021 |
Catalytic role of noise and necessity of inductive biases in the emergence of compositional communication Ł Kuciński, T Korbak, P Kołodziej, P Miłoś Advances in neural information processing systems 34, 23075-23088, 2021 | 14 | 2021 |
Scaffolded minds and the evolution of content in signaling pathways T Korbak Studies in Logic, Grammar and Rhetoric 41 (1), 89-103, 2015 | 10 | 2015 |
Measuring non-trivial compositionality in emergent communication T Korbak, J Zubek, J Rączaszek-Leonardi arXiv preprint arXiv:2010.15058, 2020 | 9 | 2020 |
Energy-based models for code generation under compilability constraints T Korbak, H Elsahar, M Dymetman, G Kruszewski arXiv preprint arXiv:2106.04985, 2021 | 8 | 2021 |