A note on the equivalence of upper confidence bounds and gittins indices for patient agents

H Bastani, M Bayati, K Khosravi - Management Science, 2021 - pubsonline.informs.org

The contextual bandit literature has traditionally focused on algorithms that address the
exploration–exploitation tradeoff. In particular, greedy algorithms that exploit current …

被引用次数：196 相关文章所有 7 个版本

[PDF] arxiv.org

Rate-optimal bayesian simple regret in best arm identification

J Komiyama, K Ariu, M Kato… - Mathematics of Operations …, 2024 - pubsonline.informs.org

We consider best arm identification in the multiarmed bandit problem. Assuming certain
continuity conditions of the prior, we characterize the rate of the Bayesian simple regret …

被引用次数：7 相关文章所有 4 个版本

Finding the optimal exploration-exploitation trade-off online through Bayesian risk estimation and minimization

S Jamieson, JP How, Y Girdhar - Artificial Intelligence, 2024 - Elsevier

We propose endogenous Bayesian risk minimization (EBRM) over policy sets as an
approach to online learning across a wide range of settings. Many real-world online learning …

被引用次数：1 相关文章所有 3 个版本

[PDF] ethz.ch

Information-Directed Sampling-Frequentist Analysis and Applications

J Kirschner - 2021 - research-collection.ethz.ch

Sequential decision-making is an iterative process between a learning agent and an
environment. We study the stochastic setting, where the learner chooses an action in each …

被引用次数：8 相关文章

Inference of a Firm's Learning Process from Product Launches

LB Ano, V Martinez-de-Albeniz - 2023 - papers.ssrn.com

In dynamic business environments, firms must make sequential decisions that account for
changes in consumer interests. As consumer interests gradually evolve, firms need to be …

被引用次数：2 相关文章

[PDF] mdpi.com

Nfsp-plt: Solving games with a weighted nfsp-per-based method

H Li, S Qi, J Zhang, D Zhang, L Yao, X Wang, Q Li… - Electronics, 2023 - mdpi.com

Nash equilibrium strategy is a typical goal when solving two-player imperfect-information
games (IIGs). Neural fictitious self-play (NFSP) is a popular method to find the Nash …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Asymptotic Randomised Control with applications to bandits

SN Cohen, T Treetanthiploet - arXiv preprint arXiv:2010.07252, 2020 - arxiv.org

We consider a general multi-armed bandit problem with correlated (and simple contextual
and restless) elements, as a relaxed control problem. By introducing an entropy …

被引用次数：8 相关文章所有 4 个版本

[PDF] academia.edu

[PDF][PDF] Correlated bandits for dynamic pricing via the arc algorithm

SN Cohen, T Treetanthiploet - arXiv preprint arXiv:2102.04263, 2021 - academia.edu

Abstract The Asymptotic Randomised Control (ARC) algorithm provides a rigorous
approximation to the optimal strategy for a wide class of Bayesian bandits, while retaining …

被引用次数：5 相关文章

[PDF] openreview.net

On adaptivity and confounding in contextual bandit experiments

C Qin, D Russo - NeurIPS 2021 Workshop on Distribution Shifts …, 2021 - openreview.net

Multi-armed bandit algorithms minimize experimentation costs required to converge on
optimal behavior. They do so by rapidly adapting experimentation effort away from poorly …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Dynamic mean field programming

G Stamatescu - arXiv preprint arXiv:2206.05200, 2022 - arxiv.org

A dynamic mean field theory is developed for finite state and action Bayesian reinforcement
learning in the large state space limit. In an analogy with statistical physics, the Bellman …

被引用次数：1 相关文章所有 2 个版本