A survey of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning
(RL) that learns from human feedback instead of relying on an engineered reward function …
(RL) that learns from human feedback instead of relying on an engineered reward function …
Tight regret bounds for single-pass streaming multi-armed bandits
C Wang - International Conference on Machine Learning, 2023 - proceedings.mlr.press
Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively,
and recent work has shown that algorithms with $ o (K) $ memory have to incur $\Omega …
and recent work has shown that algorithms with $ o (K) $ memory have to incur $\Omega …
Power constrained bandits
Contextual bandits often provide simple and effective personalization in decision making
problems, making them popular tools to deliver personalized interventions in mobile health …
problems, making them popular tools to deliver personalized interventions in mobile health …
Fast and regret optimal best arm identification: fundamental limits and low-complexity algorithms
This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual objectives:(i)
quick identification and commitment to the optimal arm, and (ii) reward maximization …
quick identification and commitment to the optimal arm, and (ii) reward maximization …
Adaptive experimental design and counterfactual inference
Adaptive experimental design methods are increasingly being used in industry as a tool to
boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing …
boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing …
Online causal inference for advertising in real-time bidding auctions
Real-time bidding systems, which utilize auctions to allocate user impressions to competing
advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of …
advertisers, continue to enjoy success in digital advertising. Assessing the effectiveness of …
Achieving the pareto frontier of regret minimization and best arm identification in multi-armed bandits
We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely,
regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore …
regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore …
Optimizing Adaptive Experiments: A Unified Approach to Regret Minimization and Best-Arm Identification
Practitioners conducting adaptive experiments often encounter two competing priorities:
reducing the cost of experimentation by effectively assigning treatments during the …
reducing the cost of experimentation by effectively assigning treatments during the …
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis
In this paper, we study reinforcement learning from human feedback (RLHF) under an
episodic Markov decision process with a general trajectory-wise reward model. We …
episodic Markov decision process with a general trajectory-wise reward model. We …
Offline Contextual Bandit: Theory and Large Scale Applications
O Sakhi - 2023 - theses.hal.science
This thesis presents contributions to the problem of learning from logged interactions using
the offline contextual bandit framework. We are interested in two related topics:(1) offline …
the offline contextual bandit framework. We are interested in two related topics:(1) offline …