Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance

T Kitagawa, J Rowley - arXiv preprint arXiv:2409.00379, 2024 - arxiv.org
Static supervised learning-in which experimental data serves as a training sample for the
estimation of an optimal treatment assignment policy-is a commonly assumed framework of …

Functional sequential treatment allocation

AB Kock, D Preinerstorfer, B Veliyev - Journal of the American …, 2022 - Taylor & Francis
Consider a setting in which a policy maker assigns subjects to treatments, observing each
outcome before the next subject arrives. Initially, it is unknown which treatment is best, but …

[PDF][PDF] policytree: Policy learning via doubly robust empirical welfare maximization over trees

E Sverdrup, A Kanodia, Z Zhou, S Athey… - Journal of Open Source …, 2020 - joss.theoj.org
The problem of learning treatment assignment policies from randomized or observational
data arises in many fields. For example, in personalized medicine, we seek to map patient …

Decision making with inference and learning methods

MW Hoffman - 2013 - open.library.ubc.ca
In this work we consider probabilistic approaches to sequential decision making. The
ultimate goal is to provide methods by which decision making problems can be attacked by …

Faster rates for policy learning

A Luedtke, A Chambaz - arXiv preprint arXiv:1704.06431, 2017 - arxiv.org
This article improves the existing proven rates of regret decay in optimal policy estimation.
We give a margin-free result showing that the regret decay for estimating a within-class …

Estimation considerations in contextual bandits

M Dimakopoulou, Z Zhou, S Athey… - arXiv preprint arXiv …, 2017 - arxiv.org
Contextual bandit algorithms are sensitive to the estimation method of the outcome model as
well as the exploration method used, particularly in the presence of rich heterogeneity or …

Learning to optimize via posterior sampling

D Russo, B Van Roy - Mathematics of Operations Research, 2014 - pubsonline.informs.org
This paper considers the use of a simple posterior sampling algorithm to balance between
exploration and exploitation when learning to optimize actions such as in multiarmed bandit …

Distributionally robust batch contextual bandits

N Si, F Zhang, Z Zhou, J Blanchet - Management Science, 2023 - pubsonline.informs.org
Policy learning using historical observational data are an important problem that has
widespread applications. Examples include selecting offers, prices, or advertisements for …

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

W Chung, V Thomas, MC Machado… - … on Machine Learning, 2021 - proceedings.mlr.press
Bandit and reinforcement learning (RL) problems can often be framed as optimization
problems where the goal is to maximize average performance while having access only to …

Bandits with unobserved confounders: A causal approach

E Bareinboim, A Forney, J Pearl - Advances in Neural …, 2015 - proceedings.neurips.cc
Abstract The Multi-Armed Bandit problem constitutes an archetypal setting for sequential
decision-making, permeating multiple domains including engineering, business, and …