Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

Is rlhf more difficult than standard rl? a theoretical perspective

Y Wang, Q Liu, C Jin - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation

X Chen, H Zhong, Z Yang, Z Wang… - … on Machine Learning, 2022 - proceedings.mlr.press
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where
instead of receiving a numeric reward at each step, the RL agent only receives preferences …

Dpo meets ppo: Reinforced token optimization for rlhf

H Zhong, G Feng, W Xiong, X Cheng, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal
Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards--a …

Dataset reset policy optimization for rlhf

JD Chang, W Zhan, O Oertell, K Brantley… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement Learning (RL) from Human Preference-based feedback is a popular
paradigm for fine-tuning generative models, which has produced impressive models such as …

Reinforcement learning with human feedback: Learning dynamic choices via pessimism

Z Li, Z Yang, M Wang - arXiv preprint arXiv:2305.18438, 2023 - arxiv.org
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …

Convex reinforcement learning in finite trials

M Mutti, R De Santi, P De Bartolomeis… - Journal of Machine …, 2023 - jmlr.org
Convex Reinforcement Learning (RL) is a recently introduced framework that generalizes
the standard RL objective to any convex (or concave) function of the state distribution …

Provable offline reinforcement learning with human feedback

W Zhan, M Uehara, N Kallus, JD Lee… - ICML 2023 Workshop …, 2023 - openreview.net
In this paper, we investigate the problem of offline reinforcement learning with human
feedback where feedback is available in the form of preference between trajectory pairs …

Advances in preference-based reinforcement learning: A review

Y Abdelkareem, S Shehata… - 2022 IEEE international …, 2022 - ieeexplore.ieee.org
Reinforcement Learning (RL) algorithms suffer from the dependency on accurately
engineered reward functions to properly guide the learning agents to do the required tasks …