Simpo: Simple preference optimization with a reference-free reward

Y Meng, M Xia, D Chen - arXiv preprint arXiv:2405.14734, 2024 - arxiv.org
Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …

Self-exploring language models: Active preference elicitation for online alignment

S Zhang, D Yu, H Sharma, H Zhong, Z Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference optimization, particularly through Reinforcement Learning from Human
Feedback (RLHF), has achieved significant success in aligning Large Language Models …

Alignment of diffusion models: Fundamentals, challenges, and future

B Liu, S Shao, B Li, L Bai, Z Xu, H Xiong, J Kwok… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion models have emerged as the leading paradigm in generative modeling, excelling
in various applications. Despite their success, these models often misalign with human …

Scaling laws for reward model overoptimization in direct alignment algorithms

R Rafailov, Y Chittepu, R Park, H Sikchi… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent
success of Large Language Models (LLMs), however, it is often a complex and brittle …

Preference tuning with human feedback on language, speech, and vision tasks: A survey

GI Winata, H Zhao, A Das, W Tang, DD Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference tuning is a crucial process for aligning deep generative models with human
preferences. This survey offers a thorough overview of recent advancements in preference …

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment

H Sun, M van der Schaar - arXiv preprint arXiv:2405.15624, 2024 - arxiv.org
Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility.
However, existing methods, primarily based on preference datasets, face challenges such …

The importance of online data: Understanding preference fine-tuning via coverage

Y Song, G Swamy, A Singh, D Bagnell… - The Thirty-eighth Annual …, 2024 - openreview.net
Learning from human preference data has emerged as the dominant paradigm for fine-
tuning large language models (LLMs). The two most common families of techniques--online …

Optimal Design for Reward Modeling in RLHF

A Scheid, E Boursier, A Durmus, MI Jordan… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has become a popular approach to
align language models (LMs) with human preferences. This method involves collecting a …

MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization

Y Lyu, L Yan, Z Wang, D Yin, P Ren, M de Rijke… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) are rapidly advancing and achieving near-human
capabilities, aligning them with human values is becoming more urgent. In scenarios where …

Sample-Efficient Alignment for LLMs

Z Liu, C Chen, C Du, WS Lee, M Lin - arXiv preprint arXiv:2411.01493, 2024 - arxiv.org
We study methods for efficiently aligning large language models (LLMs) with human
preferences given budgeted online feedback. We first formulate the LLM alignment problem …