A minimaximalist approach to reinforcement learning from human feedback

G Swamy, C Dann, R Kidambi, ZS Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …

Efficient and optimal algorithms for contextual dueling bandits under realizability

A Saha, A Krishnamurthy - International Conference on …, 2022 - proceedings.mlr.press
We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …

Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences

A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press
We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …

Variance-aware regret bounds for stochastic contextual dueling bandits

Q Di, T Jin, Y Wu, H Zhao, F Farnoud, Q Gu - arXiv preprint arXiv …, 2023 - arxiv.org
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …

Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources

R Deb, A Saha, A Banerjee - International Conference on …, 2024 - proceedings.mlr.press
We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …

Borda regret minimization for generalized linear dueling bandits

Y Wu, T Jin, H Lou, F Farnoud, Q Gu - arXiv preprint arXiv:2303.08816, 2023 - arxiv.org
Dueling bandits are widely used to model preferential feedback prevalent in many
applications such as recommendation systems and ranking. In this paper, we study the …

Nearly optimal algorithms for contextual dueling bandits from adversarial feedback

Q Di, J He, Q Gu - arXiv preprint arXiv:2404.10776, 2024 - arxiv.org
Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be …

Anaconda: An improved dynamic regret algorithm for adaptive non-stationary dueling bandits

TK Buening, A Saha - International Conference on Artificial …, 2023 - proceedings.mlr.press
We study the problem of non-stationary dueling bandits and provide the first adaptive
dynamic regret algorithm for this problem. The only two existing attempts in this line of work …

One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits

P Gaillard, A Saha, S Dan - International Conference on …, 2023 - proceedings.mlr.press
We address the problem of Internal Regret in adversarial Sleeping Bandits and the
relationship between different notions of sleeping regrets in multi-armed bandits. We …

Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits

A Saha, S Gupta - International Conference on Machine …, 2022 - proceedings.mlr.press
We study the problem of dynamic regret minimization in $ K $-armed Dueling Bandits under
non-stationary or time-varying preferences. This is an online learning setup where the agent …