Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
Is rlhf more difficult than standard rl? a theoretical perspective
Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …
Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where
instead of receiving a numeric reward at each step, the RL agent only receives preferences …
instead of receiving a numeric reward at each step, the RL agent only receives preferences …
Dpo meets ppo: Reinforced token optimization for rlhf
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …
Dataset reset policy optimization for rlhf
Reinforcement Learning (RL) from Human Preference-based feedback is a popular
paradigm for fine-tuning generative models, which has produced impressive models such as …
paradigm for fine-tuning generative models, which has produced impressive models such as …
Reinforcement learning with human feedback: Learning dynamic choices via pessimism
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …
Convex reinforcement learning in finite trials
Convex Reinforcement Learning (RL) is a recently introduced framework that generalizes
the standard RL objective to any convex (or concave) function of the state distribution …
the standard RL objective to any convex (or concave) function of the state distribution …
Provable offline reinforcement learning with human feedback
In this paper, we investigate the problem of offline reinforcement learning with human
feedback where feedback is available in the form of preference between trajectory pairs …
feedback where feedback is available in the form of preference between trajectory pairs …
Advances in preference-based reinforcement learning: A review
Y Abdelkareem, S Shehata… - 2022 IEEE international …, 2022 - ieeexplore.ieee.org
Reinforcement Learning (RL) algorithms suffer from the dependency on accurately
engineered reward functions to properly guide the learning agents to do the required tasks …
engineered reward functions to properly guide the learning agents to do the required tasks …