Battle of Bandits.

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press

We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

被引用次数：168 相关文章所有 8 个版本

[PDF] mlr.press

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

被引用次数：34 相关文章

[PDF] arxiv.org

Dueling rl: reinforcement learning with trajectory preferences

A Pacchiano, A Saha, J Lee - arXiv preprint arXiv:2111.04850, 2021 - arxiv.org

We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …

被引用次数：51 相关文章所有 2 个版本

[PDF] jmlr.org

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org

In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …

被引用次数：119 相关文章所有 7 个版本

[PDF] mlr.press

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press

We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …

被引用次数：23 相关文章所有 2 个版本

[PDF] neurips.cc

Optimal algorithms for stochastic contextual preference bandits

A Saha - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc

We consider the problem of preference bandits in the contextual setting. At each round, the
learner is presented with a context set of $ K $ items, chosen randomly from a potentially …

被引用次数：38 相关文章所有 5 个版本

[PDF] arxiv.org

Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf

B Zhu, MI Jordan, J Jiao - arXiv preprint arXiv:2401.16335, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns
language models closely with human-centric values. The initial phase of RLHF involves …

被引用次数：22 相关文章所有 3 个版本

[PDF] mlr.press

Stochastic contextual dueling bandits under linear stochastic transitivity models

V Bengs, A Saha, E Hüllermeier - … Conference on Machine …, 2022 - proceedings.mlr.press

We consider the regret minimization task in a dueling bandits problem with context
information. In every round of the sequential decision problem, the learner makes a context …

被引用次数：25 相关文章所有 5 个版本

[PDF] arxiv.org

Provable benefits of policy learning from human preferences in contextual bandit problems

X Ji, H Wang, M Chen, T Zhao, M Wang - arXiv preprint arXiv:2307.12975, 2023 - arxiv.org

A crucial task in decision-making problems is reward engineering. It is common in practice
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …

被引用次数：9 相关文章所有 3 个版本

[PDF] mlr.press

Adversarial dueling bandits

A Saha, T Koren, Y Mansour - International Conference on …, 2021 - proceedings.mlr.press

We introduce the problem of regret minimization in Adversarial Dueling Bandits. As in
classic Dueling Bandits, the learner has to repeatedly choose a pair of items and observe …

被引用次数：27 相关文章所有 7 个版本

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons

Dueling rl: Reinforcement learning with trajectory preferences

Dueling rl: reinforcement learning with trajectory preferences

Preference-based online learning with dueling bandits: A survey

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

Optimal algorithms for stochastic contextual preference bandits

Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf

Stochastic contextual dueling bandits under linear stochastic transitivity models

Provable benefits of policy learning from human preferences in contextual bandit problems

Adversarial dueling bandits

高级搜索

引用