Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

K D'Oosterlinck, W Xu, C Develder… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are often aligned using contrastive alignment objectives
and preference pair datasets. The interaction between model, paired data, and objective …

Reinforcement Learning Enhanced LLMs: A Survey

S Wang, S Zhang, J Zhang, R Hu, X Li, T Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper surveys research in the rapidly growing field of enhancing large language
models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve …

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

P Rashidinejad, Y Tian - arXiv preprint arXiv:2412.09544, 2024 - arxiv.org
Aligning AI systems with human preferences typically suffers from the infamous reward
hacking problem, where optimization of an imperfect reward model leads to undesired …

-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

J Wu, X Wang, Z Yang, J Wu, J Gao, B Ding… - arXiv preprint arXiv …, 2024 - arxiv.org
Aligning large language models (LLMs) with human values and intentions is crucial for their
utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a …

REAL: Response Embedding-based Alignment for LLMs

H Zhang, X Zhao, I Molybog, J Zhang - arXiv preprint arXiv:2409.17169, 2024 - arxiv.org
Aligning large language models (LLMs) to human preferences is a crucial step in building
helpful and safe AI tools, which usually involve training on supervised datasets. Popular …

Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization

Z Chen, F Liu, J Zhu, W Du, Y Qi - arXiv preprint arXiv:2411.05875, 2024 - arxiv.org
Direct Preference Optimization (DPO) and its variants have become the de facto standards
for aligning large language models (LLMs) with human preferences or specific goals …

SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks

F Christopoulou, R Cardenas, G Lampouras… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference Optimization (PO) has proven an effective step for aligning language models to
human-desired behaviors. Current variants, following the offline Direct Preference …