Adversarial dueling bandits

G Swamy, C Dann, R Kidambi, ZS Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

被引用次数：53 相关文章所有 3 个版本

[PDF] mlr.press

Efficient and optimal algorithms for contextual dueling bandits under realizability

A Saha, A Krishnamurthy - International Conference on …, 2022 - proceedings.mlr.press

We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …

被引用次数：39 相关文章所有 3 个版本

[PDF] mlr.press

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press

We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …

被引用次数：23 相关文章所有 2 个版本

[PDF] arxiv.org

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T Jin, Y Wu, H Zhao, F Farnoud, Q Gu - arXiv preprint arXiv …, 2023 - arxiv.org

Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

被引用次数：10 相关文章所有 4 个版本

[PDF] mlr.press

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

R Deb, A Saha, A Banerjee - International Conference on …, 2024 - proceedings.mlr.press

We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …

被引用次数：2 相关文章所有 3 个版本

[PDF] arxiv.org

Borda regret minimization for generalized linear dueling bandits

Y Wu, T Jin, H Lou, F Farnoud, Q Gu - arXiv preprint arXiv:2303.08816, 2023 - arxiv.org

Dueling bandits are widely used to model preferential feedback prevalent in many
applications such as recommendation systems and ranking. In this paper, we study the …

被引用次数：9 相关文章所有 4 个版本

[PDF] arxiv.org

Nearly optimal algorithms for contextual dueling bandits from adversarial feedback

Q Di, J He, Q Gu - arXiv preprint arXiv:2404.10776, 2024 - arxiv.org

Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be …

被引用次数：1 相关文章所有 2 个版本

[PDF] mlr.press

Anaconda: An improved dynamic regret algorithm for adaptive non-stationary dueling bandits

TK Buening, A Saha - International Conference on Artificial …, 2023 - proceedings.mlr.press

We study the problem of non-stationary dueling bandits and provide the first adaptive
dynamic regret algorithm for this problem. The only two existing attempts in this line of work …

被引用次数：7 相关文章所有 2 个版本

[PDF] mlr.press

One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits

P Gaillard, A Saha, S Dan - International Conference on …, 2023 - proceedings.mlr.press

We address the problem of Internal Regret in adversarial Sleeping Bandits and the
relationship between different notions of sleeping regrets in multi-armed bandits. We …

被引用次数：3 相关文章所有 7 个版本

[PDF] mlr.press

Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits

A Saha, S Gupta - International Conference on Machine …, 2022 - proceedings.mlr.press

We study the problem of dynamic regret minimization in $ K $-armed Dueling Bandits under
non-stationary or time-varying preferences. This is an online learning setup where the agent …

被引用次数：8 相关文章所有 5 个版本

A minimaximalist approach to reinforcement learning from human feedback

Efficient and optimal algorithms for contextual dueling bandits under realizability

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

Variance-aware regret bounds for stochastic contextual dueling bandits

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

Borda regret minimization for generalized linear dueling bandits

Nearly optimal algorithms for contextual dueling bandits from adversarial feedback

Anaconda: An improved dynamic regret algorithm for adaptive non-stationary dueling bandits

One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits

Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits

高级搜索

引用