Efficient and optimal algorithms for contextual dueling bandits under realizability

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press

We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

被引用次数：168 相关文章所有 8 个版本

[PDF] arxiv.org

A minimaximalist approach to reinforcement learning from human feedback

G Swamy, C Dann, R Kidambi, ZS Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

被引用次数：53 相关文章所有 3 个版本

[PDF] arxiv.org

A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

被引用次数：108 相关文章所有 4 个版本

[PDF] arxiv.org

Reinforcement learning with human feedback: Learning dynamic choices via pessimism

Z Li, Z Yang, M Wang - arXiv preprint arXiv:2305.18438, 2023 - arxiv.org

In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where
we aim to learn the human's underlying reward and the MDP's optimal policy from a set of …

被引用次数：43 相关文章所有 4 个版本

[PDF] mlr.press

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

被引用次数：34 相关文章

[PDF] arxiv.org

Dueling rl: reinforcement learning with trajectory preferences

A Pacchiano, A Saha, J Lee - arXiv preprint arXiv:2111.04850, 2021 - arxiv.org

We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …

被引用次数：51 相关文章所有 2 个版本

[PDF] mlr.press

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press

We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …

被引用次数：23 相关文章所有 2 个版本

[PDF] arxiv.org

Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf

B Zhu, MI Jordan, J Jiao - arXiv preprint arXiv:2401.16335, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns
language models closely with human-centric values. The initial phase of RLHF involves …

被引用次数：22 相关文章所有 3 个版本

[PDF] arxiv.org

Contextual bandits and imitation learning via preference-based active queries

A Sekhari, K Sridharan, W Sun, R Wu - arXiv preprint arXiv:2307.12926, 2023 - arxiv.org

We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

被引用次数：8 相关文章所有 2 个版本

[PDF] neurips.cc

Contextual bandits and imitation learning with preference-based active queries

A Sekhari, K Sridharan, W Sun… - Advances in Neural …, 2024 - proceedings.neurips.cc

We consider the problem of contextual bandits and imitation learning, where the learner
lacks direct knowledge of the executed action's reward. Instead, the learner can actively …

被引用次数：6 相关文章所有 6 个版本