A survey of reinforcement learning from human feedback

T Kaufmann, P Weng, V Bengs… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …

Tight regret bounds for single-pass streaming multi-armed bandits

C Wang - International Conference on Machine Learning, 2023 - proceedings.mlr.press
Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively,
and recent work has shown that algorithms with $ o (K) $ memory have to incur $\Omega …

Power constrained bandits

J Yao, E Brunskill, W Pan, S Murphy… - Machine Learning …, 2021 - proceedings.mlr.press
Contextual bandits often provide simple and effective personalization in decision making
problems, making them popular tools to deliver personalized interventions in mobile health …

Fast and regret optimal best arm identification: fundamental limits and low-complexity algorithms

Q Zhang, L Ying - Advances in Neural Information …, 2024 - proceedings.neurips.cc
This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual objectives:(i)
quick identification and commitment to the optimal arm, and (ii) reward maximization …

Adaptive experimental design and counterfactual inference

T Fiez, S Gamez, A Chen, H Nassif, L Jain - arXiv preprint arXiv …, 2022 - arxiv.org
Adaptive experimental design methods are increasingly being used in industry as a tool to
boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing …

Online causal inference for advertising in real-time bidding auctions

C Waisman, HS Nair, C Carrion - Marketing Science, 2024 - pubsonline.informs.org
Real-time bidding systems, which utilize auctions to allocate user impressions to competing
advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of …

Achieving the pareto frontier of regret minimization and best arm identification in multi-armed bandits

Z Zhong, WC Cheung, VYF Tan - arXiv preprint arXiv:2110.08627, 2021 - arxiv.org
We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely,
regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore …

Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification

C Qin, D Russo - arXiv preprint arXiv:2402.10592, 2024 - arxiv.org
Practitioners conducting adaptive experiments often encounter two competing priorities:
reducing the cost of experimentation by effectively assigning treatments during the …

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Q Zhang, H Wei, L Ying - arXiv preprint arXiv:2406.07455, 2024 - arxiv.org
In this paper, we study reinforcement learning from human feedback (RLHF) under an
episodic Markov decision process with a general trajectory-wise reward model. We …

Offline Contextual Bandit: Theory and Large Scale Applications

O Sakhi - 2023 - theses.hal.science
This thesis presents contributions to the problem of learning from logged interactions using
the offline contextual bandit framework. We are interested in two related topics:(1) offline …