Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …
A minimaximalist approach to reinforcement learning from human feedback
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …
learning from human feedback. Our approach is minimalist in that it does not require training …
A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Reinforcement learning with human feedback: Learning dynamic choices via pessimism
In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …
Dueling rl: Reinforcement learning with trajectory preferences
We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …
Dueling rl: reinforcement learning with trajectory preferences
We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …
Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences
A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press
We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …
environments, where the goal of the learner is to aggregate information through relative …
Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns
language models closely with human-centric values. The initial phase of RLHF involves …
language models closely with human-centric values. The initial phase of RLHF involves …
Contextual bandits and imitation learning via preference-based active queries
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
Contextual bandits and imitation learning with preference-based active queries
We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …