Policy gradient optimization of Thompson sampling policies

B Kveton, M Konobeev, M Zaheer… - International …, 2021 - proceedings.mlr.press

Efficient exploration in bandits is a fundamental online learning problem. We propose a
variant of Thompson sampling that learns to explore better as it interacts with bandit …

被引用次数：77 相关文章所有 11 个版本

[PDF] neurips.cc

No regrets for learning the prior in bandits

S Basu, B Kveton, M Zaheer… - Advances in neural …, 2021 - proceedings.neurips.cc

Abstract We propose AdaTS, a Thompson sampling algorithm that adapts sequentially to
bandit tasks that it interacts with. The key idea in AdaTS is to adapt to an unknown task prior …

被引用次数：38 相关文章所有 7 个版本

[PDF] neurips.cc

Differentiable meta-learning of bandit policies

C Boutilier, C Hsu, B Kveton… - Advances in …, 2020 - proceedings.neurips.cc

Exploration policies in Bayesian bandits maximize the average reward over problem
instances drawn from some distribution P. In this work, we learn such policies for an …

被引用次数：28 相关文章所有 9 个版本

[PDF] aaai.org

Meta-learning for simple regret minimization

J Azizi, B Kveton, M Ghavamzadeh… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

We develop a meta-learning framework for simple regret minimization in bandits. In this
framework, a learning agent interacts with a sequence of bandit tasks, which are sampled iid …

被引用次数：11 相关文章所有 5 个版本

[PDF] mlr.press

Restless and uncertain: Robust policies for restless bandits via deep multi-agent reinforcement learning

JA Killian, L Xu, A Biswas… - Uncertainty in Artificial …, 2022 - proceedings.mlr.press

We introduce robustness in\textit {restless multi-armed bandits}(RMABs), a popular model
for constrained resource allocation among independent stochastic processes (arms). Nearly …

被引用次数：8 相关文章所有 7 个版本

[PDF] arxiv.org

AExGym: Benchmarks and Environments for Adaptive Experimentation

J Wang, E Che, DR Jiang, H Namkoong - arXiv preprint arXiv:2408.04531, 2024 - arxiv.org

Innovations across science and industry are evaluated using randomized trials (aka A/B
tests). While simple and robust, such static designs are inefficient or infeasible for testing …

Adaptive Experimentation at Scale: A Computational Framework for Flexible Batches

E Che, H Namkoong - arXiv preprint arXiv:2303.11582, 2023 - arxiv.org

Standard bandit algorithms that assume continual reallocation of measurement effort are
challenging to implement due to delayed feedback and infrastructural/organizational …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Meta-learning bandit policies by gradient ascent

B Kveton, M Mladenov, CW Hsu, M Zaheer… - arXiv preprint arXiv …, 2020 - arxiv.org

Most bandit policies are designed to either minimize regret in any problem instance, making
very few assumptions about the underlying environment, or in a Bayesian sense, assuming …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Improving Thompson Sampling via Information Relaxation for Budgeted Multi-armed Bandits

W Jeong, S Min - arXiv preprint arXiv:2408.15535, 2024 - arxiv.org

We consider a Bayesian budgeted multi-armed bandit problem, in which each arm
consumes a different amount of resources when selected and there is a budget constraint on …

Advertising Media and Target Audience Optimization via High-dimensional Bandits

W Ba, JM Harrison, HS Nair - arXiv preprint arXiv:2209.08403, 2022 - arxiv.org

We present a data-driven algorithm that advertisers can use to automate their digital ad-
campaigns at online publishers. The algorithm enables the advertiser to search across …

被引用次数：1 相关文章所有 2 个版本