Meta-thompson sampling

B Kveton, M Konobeev, M Zaheer… - International …, 2021 - proceedings.mlr.press
Efficient exploration in bandits is a fundamental online learning problem. We propose a
variant of Thompson sampling that learns to explore better as it interacts with bandit …

No regrets for learning the prior in bandits

S Basu, B Kveton, M Zaheer… - Advances in neural …, 2021 - proceedings.neurips.cc
Abstract We propose AdaTS, a Thompson sampling algorithm that adapts sequentially to
bandit tasks that it interacts with. The key idea in AdaTS is to adapt to an unknown task prior …

Differentiable meta-learning of bandit policies

C Boutilier, C Hsu, B Kveton… - Advances in …, 2020 - proceedings.neurips.cc
Exploration policies in Bayesian bandits maximize the average reward over problem
instances drawn from some distribution P. In this work, we learn such policies for an …

Meta-learning for simple regret minimization

J Azizi, B Kveton, M Ghavamzadeh… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
We develop a meta-learning framework for simple regret minimization in bandits. In this
framework, a learning agent interacts with a sequence of bandit tasks, which are sampled iid …

Restless and uncertain: Robust policies for restless bandits via deep multi-agent reinforcement learning

JA Killian, L Xu, A Biswas… - Uncertainty in Artificial …, 2022 - proceedings.mlr.press
We introduce robustness in\textit {restless multi-armed bandits}(RMABs), a popular model
for constrained resource allocation among independent stochastic processes (arms). Nearly …

AExGym: Benchmarks and Environments for Adaptive Experimentation

J Wang, E Che, DR Jiang, H Namkoong - arXiv preprint arXiv:2408.04531, 2024 - arxiv.org
Innovations across science and industry are evaluated using randomized trials (aka A/B
tests). While simple and robust, such static designs are inefficient or infeasible for testing …

Adaptive Experimentation at Scale: A Computational Framework for Flexible Batches

E Che, H Namkoong - arXiv preprint arXiv:2303.11582, 2023 - arxiv.org
Standard bandit algorithms that assume continual reallocation of measurement effort are
challenging to implement due to delayed feedback and infrastructural/organizational …

Meta-learning bandit policies by gradient ascent

B Kveton, M Mladenov, CW Hsu, M Zaheer… - arXiv preprint arXiv …, 2020 - arxiv.org
Most bandit policies are designed to either minimize regret in any problem instance, making
very few assumptions about the underlying environment, or in a Bayesian sense, assuming …

Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits

W Jeong, S Min - arXiv preprint arXiv:2408.15535, 2024 - arxiv.org
We consider a Bayesian budgeted multi-armed bandit problem, in which each arm
consumes a different amount of resources when selected and there is a budget constraint on …

Advertising Media and Target Audience Optimization via High-dimensional Bandits

W Ba, JM Harrison, HS Nair - arXiv preprint arXiv:2209.08403, 2022 - arxiv.org
We present a data-driven algorithm that advertisers can use to automate their digital ad-
campaigns at online publishers. The algorithm enables the advertiser to search across …