A relative exponential weighing algorithm for adversarial utility-based dueling bandits

B Zhu, M Jordan, J Jiao - International Conference on …, 2023 - proceedings.mlr.press

We provide a theoretical framework for Reinforcement Learning with Human Feedback
(RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry …

被引用次数：153 相关文章所有 8 个版本

[PDF] microsoft.com

Towards conversational recommender systems

K Christakopoulou, F Radlinski… - Proceedings of the 22nd …, 2016 - dl.acm.org

People often ask others for restaurant recommendations as a way to discover new dining
experiences. This makes restaurant recommendation an exciting scenario for recommender …

被引用次数：488 相关文章所有 7 个版本

[PDF] mlr.press

Dueling rl: Reinforcement learning with trajectory preferences

A Saha, A Pacchiano, J Lee - International Conference on …, 2023 - proceedings.mlr.press

We consider the problem of preference-based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning (RL), an agent receives feedback only in terms of 1 bit …

被引用次数：33 相关文章

[PDF] arxiv.org

Dueling rl: reinforcement learning with trajectory preferences

A Pacchiano, A Saha, J Lee - arXiv preprint arXiv:2111.04850, 2021 - arxiv.org

We consider the problem of preference based reinforcement learning (PbRL), where, unlike
traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) …

被引用次数：50 相关文章所有 2 个版本

[PDF] jmlr.org

Preference-based online learning with dueling bandits: A survey

V Bengs, R Busa-Fekete, A El Mesaoudi-Paul… - Journal of Machine …, 2021 - jmlr.org

In machine learning, the notion of multi-armed bandits refers to a class of online learning
problems, in which an agent is supposed to simultaneously explore and exploit a given set …

被引用次数：116 相关文章所有 7 个版本

[PDF] mlr.press

Efficient and optimal algorithms for contextual dueling bandits under realizability

A Saha, A Krishnamurthy - International Conference on …, 2022 - proceedings.mlr.press

We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …

被引用次数：39 相关文章所有 3 个版本

[PDF] mlr.press

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press

We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …

被引用次数：22 相关文章所有 2 个版本

[PDF] arxiv.org

Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf

B Zhu, MI Jordan, J Jiao - arXiv preprint arXiv:2401.16335, 2024 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns
language models closely with human-centric values. The initial phase of RLHF involves …

被引用次数：17 相关文章所有 3 个版本

[PDF] arxiv.org

Multi-dueling bandits with dependent arms

Y Sui, V Zhuang, JW Burdick, Y Yue - arXiv preprint arXiv:1705.00253, 2017 - arxiv.org

The dueling bandits problem is an online learning framework for learning from pairwise
preference feedback, and is particularly well-suited for modeling settings that elicit …

被引用次数：89 相关文章所有 9 个版本

[PDF] ijcai.org

[PDF][PDF] Advancements in Dueling Bandits.

Y Sui, M Zoghi, K Hofmann, Y Yue - IJCAI, 2018 - ijcai.org

The dueling bandits problem is an online learning framework where learning happens “on-
thefly” through preference feedback, ie, from comparisons between a pair of actions. Unlike …

被引用次数：74 相关文章所有 5 个版本