The language barrier: Dissecting safety challenges of llms in multilingual contexts

L Shen, W Tan, S Chen, Y Chen, J Zhang, H Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
As the influence of large language models (LLMs) spans across global communities, their
safety challenges in multilingual settings become paramount for alignment research. This …

The alignment ceiling: Objective mismatch in reinforcement learning from human feedback

N Lambert, R Calandra - arXiv preprint arXiv:2311.00168, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique
to make large language models (LLMs) more capable in complex settings. RLHF proceeds …

Entangled preferences: The history and risks of reinforcement learning and human feedback

N Lambert, TK Gilbert, T Zick - arXiv preprint arXiv:2310.13595, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique
to make large language models (LLMs) easier to use and more effective. A core piece of the …

Transforming and Combining Rewards for Aligning Large Language Models

Z Wang, C Nagpal, J Berant, J Eisenstein… - arXiv preprint arXiv …, 2024 - arxiv.org
A common approach for aligning language models to human preferences is to first learn a
reward model from preference data, and then use this reward model to update the language …

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

T Lu, L Shen, X Yang, W Tan, B Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) involves training policy models
(PMs) and reward models (RMs) to align language models with human preferences. Instead …

[PDF][PDF] Research Agenda for Sociotechnical Approaches to AI Safety

S Curtis, R Iyer, CD Kirk-Giannini, V Krakovna… - ai.objectives.institute
As the capabilities of AI systems continue to advance, it is increasingly important that we
guide the development of these powerful technologies, ensuring they are used for the …