The trickle-down impact of reward (in-) consistency on rlhf

L Shen, W Tan, S Chen, Y Chen, J Zhang, H Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

As the influence of large language models (LLMs) spans across global communities, their
safety challenges in multilingual settings become paramount for alignment research. This …

被引用次数：23 相关文章所有 2 个版本

The alignment ceiling: Objective mismatch in reinforcement learning from human feedback

N Lambert, R Calandra - arXiv preprint arXiv:2311.00168, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique
to make large language models (LLMs) more capable in complex settings. RLHF proceeds …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Entangled preferences: The history and risks of reinforcement learning and human feedback

N Lambert, TK Gilbert, T Zick - arXiv preprint arXiv:2310.13595, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique
to make large language models (LLMs) easier to use and more effective. A core piece of the …

被引用次数：7 相关文章

[PDF] arxiv.org

Transforming and Combining Rewards for Aligning Large Language Models

Z Wang, C Nagpal, J Berant, J Eisenstein… - arXiv preprint arXiv …, 2024 - arxiv.org

A common approach for aligning language models to human preferences is to first learn a
reward model from preference data, and then use this reward model to update the language …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

T Lu, L Shen, X Yang, W Tan, B Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) involves training policy models
(PMs) and reward models (RMs) to align language models with human preferences. Instead …

[PDF][PDF] Research Agenda for Sociotechnical Approaches to AI Safety

S Curtis, R Iyer, CD Kirk-Giannini, V Krakovna… - ai.objectives.institute

As the capabilities of AI systems continue to advance, it is increasingly important that we
guide the development of these powerful technologies, ensuring they are used for the …