Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press
We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press
We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

Dueling rl: reinforcement learning with trajectory preferences

A Pacchiano, A Saha, J Lee - arXiv preprint arXiv:2111.04850, 2021 - arxiv.org
We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org
In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press
We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …

Optimal algorithms for stochastic contextual preference bandits

A Saha - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc
We consider the problem of preference bandits in the contextual setting. At each round, the
learner is presented with a context set of $ K $ items, chosen randomly from a potentially …

Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf

B Zhu, MI Jordan, J Jiao - arXiv preprint arXiv:2401.16335, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns
language models closely with human-centric values. The initial phase of RLHF involves …

Stochastic contextual dueling bandits under linear stochastic transitivity models

V Bengs, A Saha, E Hüllermeier - … Conference on Machine …, 2022 - proceedings.mlr.press
We consider the regret minimization task in a dueling bandits problem with context
information. In every round of the sequential decision problem, the learner makes a context …

Provable benefits of policy learning from human preferences in contextual bandit problems

X Ji, H Wang, M Chen, T Zhao, M Wang - arXiv preprint arXiv:2307.12975, 2023 - arxiv.org
A crucial task in decision-making problems is reward engineering. It is common in practice
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …

Adversarial dueling bandits

A Saha, T Koren, Y Mansour - International Conference on …, 2021 - proceedings.mlr.press
We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in
classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe …