Dpo meets ppo: Reinforced token optimization for rlhf
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of …
Human feedback plays a central role in the alignment of Large Language Models (LLMs).
However, open questions remain about the methods (how), domains (where), people (who) …
However, open questions remain about the methods (how), domains (where), people (who) …
Scaling laws for reward model overoptimization in direct alignment algorithms
Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent
success of Large Language Models (LLMs), however, it is often a complex and brittle …
success of Large Language Models (LLMs), however, it is often a complex and brittle …
Learn your reference model for real good alignment
The complexity of the alignment problem stems from the fact that existing methods are
unstable. Researchers continuously invent various tricks to address this shortcoming. For …
unstable. Researchers continuously invent various tricks to address this shortcoming. For …
Training Language Models to Self-Correct via Reinforcement Learning
Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Existing approaches for …
consistently been found to be largely ineffective in modern LLMs. Existing approaches for …
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
Preference tuning is a crucial process for aligning deep generative models with human
preferences. This survey offers a thorough overview of recent advancements in preference …
preferences. This survey offers a thorough overview of recent advancements in preference …
Stepwise Alignment for Constrained Language Model Policy Optimization
Safety and trustworthiness are indispensable requirements for applying AI systems based
on large language models (LLMs) in real-world applications. This paper formulates a human …
on large language models (LLMs) in real-world applications. This paper formulates a human …
Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models
Recent advancements in generative models have sparked significant interest within the
machine learning community. Particularly, diffusion models have demonstrated remarkable …
machine learning community. Particularly, diffusion models have demonstrated remarkable …
Sample complexity of variance-reduced policy gradient: weaker assumptions and lower bounds
Several variance-reduced versions of REINFORCE based on importance sampling achieve
an improved O (ϵ-3) sample complexity to find an ϵ-stationary point, under an unrealistic …
an improved O (ϵ-3) sample complexity to find an ϵ-stationary point, under an unrealistic …
A Survey on Human Preference Learning for Large Language Models
The recent surge of versatile large language models (LLMs) largely depends on aligning
increasingly capable foundation models with human intentions by preference learning …
increasingly capable foundation models with human intentions by preference learning …