Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment
Large Language Models (LLMs) are often aligned using contrastive alignment objectives
and preference pair datasets. The interaction between model, paired data, and objective …
and preference pair datasets. The interaction between model, paired data, and objective …
Reinforcement Learning Enhanced LLMs: A Survey
This paper surveys research in the rapidly growing field of enhancing large language
models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve …
models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve …
Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
P Rashidinejad, Y Tian - arXiv preprint arXiv:2412.09544, 2024 - arxiv.org
Aligning AI systems with human preferences typically suffers from the infamous reward
hacking problem, where optimization of an imperfect reward model leads to undesired …
hacking problem, where optimization of an imperfect reward model leads to undesired …
-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs
Aligning large language models (LLMs) with human values and intentions is crucial for their
utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a …
utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a …
REAL: Response Embedding-based Alignment for LLMs
Aligning large language models (LLMs) to human preferences is a crucial step in building
helpful and safe AI tools, which usually involve training on supervised datasets. Popular …
helpful and safe AI tools, which usually involve training on supervised datasets. Popular …
Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization
Direct Preference Optimization (DPO) and its variants have become the de facto standards
for aligning large language models (LLMs) with human preferences or specific goals …
for aligning large language models (LLMs) with human preferences or specific goals …
SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks
Preference Optimization (PO) has proven an effective step for aligning language models to
human-desired behaviors. Current variants, following the offline Direct Preference …
human-desired behaviors. Current variants, following the offline Direct Preference …