Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, L Zhao, D He… - arXiv preprint arXiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of …

HR Kirk, A Whitefield, P Röttger, A Bean… - arXiv preprint arXiv …, 2024 - arxiv.org
Human feedback plays a central role in the alignment of Large Language Models (LLMs).
However, open questions remain about the methods (how), domains (where), people (who) …

Scaling laws for reward model overoptimization in direct alignment algorithms

R Rafailov, Y Chittepu, R Park, H Sikchi… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent
success of Large Language Models (LLMs), however, it is often a complex and brittle …

Learn your reference model for real good alignment

A Gorbatovski, B Shaposhnikov, A Malakhov… - arXiv preprint arXiv …, 2024 - arxiv.org
The complexity of the alignment problem stems from the fact that existing methods are
unstable. Researchers continuously invent various tricks to address this shortcoming. For …

Training Language Models to Self-Correct via Reinforcement Learning

A Kumar, V Zhuang, R Agarwal, Y Su… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Existing approaches for …

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

GI Winata, H Zhao, A Das, W Tang, DD Yao… - arXiv preprint arXiv …, 2024 - arxiv.org
Preference tuning is a crucial process for aligning deep generative models with human
preferences. This survey offers a thorough overview of recent advancements in preference …

Stepwise Alignment for Constrained Language Model Policy Optimization

A Wachi, TQ Tran, R Sato, T Tanabe… - arXiv preprint arXiv …, 2024 - arxiv.org
Safety and trustworthiness are indispensable requirements for applying AI systems based
on large language models (LLMs) in real-world applications. This paper formulates a human …

Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models

J Chen, JS Byun, M Elsner, A Perrault - arXiv preprint arXiv:2405.14632, 2024 - arxiv.org
Recent advancements in generative models have sparked significant interest within the
machine learning community. Particularly, diffusion models have demonstrated remarkable …

Sample complexity of variance-reduced policy gradient: weaker assumptions and lower bounds

G Paczolay, M Papini, AM Metelli, I Harmati… - Machine Learning, 2024 - Springer
Several variance-reduced versions of REINFORCE based on importance sampling achieve
an improved O (ϵ-3) sample complexity to find an ϵ-stationary point, under an unrealistic …

A Survey on Human Preference Learning for Large Language Models

R Jiang, K Chen, X Bai, Z He, J Li, M Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
The recent surge of versatile large language models (LLMs) largely depends on aligning
increasingly capable foundation models with human intentions by preference learning …