A minimaximalist approach to reinforcement learning from human feedback
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement
learning from human feedback. Our approach is minimalist in that it does not require training …
learning from human feedback. Our approach is minimalist in that it does not require training …
Efficient and optimal algorithms for contextual dueling bandits under realizability
A Saha, A Krishnamurthy - International Conference on …, 2022 - proceedings.mlr.press
We study the $ K $-armed contextual dueling bandit problem, a sequential decision making
setting in which the learner uses contextual information to make two decisions, but only …
setting in which the learner uses contextual information to make two decisions, but only …
Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences
A Saha, P Gaillard - International Conference on Machine …, 2022 - proceedings.mlr.press
We study the problem of $ K $-armed dueling bandit for both stochastic and adversarial
environments, where the goal of the learner is to aggregate information through relative …
environments, where the goal of the learner is to aggregate information through relative …
Variance-aware regret bounds for stochastic contextual dueling bandits
Dueling bandits is a prominent framework for decision-making involving preferential
feedback, a valuable feature that fits various applications involving human interaction, such …
feedback, a valuable feature that fits various applications involving human interaction, such …
Think Before You Duel: Understanding Complexities of Preference Learning under Constrained Resources
We consider the problem of reward maximization in the dueling bandit setup along with
constraints on resource consumption. As in the classic dueling bandits, at each round the …
constraints on resource consumption. As in the classic dueling bandits, at each round the …
Borda regret minimization for generalized linear dueling bandits
Dueling bandits are widely used to model preferential feedback prevalent in many
applications such as recommendation systems and ranking. In this paper, we study the …
applications such as recommendation systems and ranking. In this paper, we study the …
Nearly optimal algorithms for contextual dueling bandits from adversarial feedback
Learning from human feedback plays an important role in aligning generative models, such
as large language models (LLM). However, the effectiveness of this approach can be …
as large language models (LLM). However, the effectiveness of this approach can be …
Anaconda: An improved dynamic regret algorithm for adaptive non-stationary dueling bandits
TK Buening, A Saha - International Conference on Artificial …, 2023 - proceedings.mlr.press
We study the problem of non-stationary dueling bandits and provide the first adaptive
dynamic regret algorithm for this problem. The only two existing attempts in this line of work …
dynamic regret algorithm for this problem. The only two existing attempts in this line of work …
One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits
We address the problem of Internal Regret in adversarial Sleeping Bandits and the
relationship between different notions of sleeping regrets in multi-armed bandits. We …
relationship between different notions of sleeping regrets in multi-armed bandits. We …
Optimal and efficient dynamic regret algorithms for non-stationary dueling bandits
We study the problem of dynamic regret minimization in $ K $-armed Dueling Bandits under
non-stationary or time-varying preferences. This is an online learning setup where the agent …
non-stationary or time-varying preferences. This is an online learning setup where the agent …