Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance
T Kitagawa, J Rowley - arXiv preprint arXiv:2409.00379, 2024 - arxiv.org
Static supervised learning-in which experimental data serves as a training sample for the
estimation of an optimal treatment assignment policy-is a commonly assumed framework of …
estimation of an optimal treatment assignment policy-is a commonly assumed framework of …
Functional sequential treatment allocation
Consider a setting in which a policy maker assigns subjects to treatments, observing each
outcome before the next subject arrives. Initially, it is unknown which treatment is best, but …
outcome before the next subject arrives. Initially, it is unknown which treatment is best, but …
[PDF][PDF] policytree: Policy learning via doubly robust empirical welfare maximization over trees
The problem of learning treatment assignment policies from randomized or observational
data arises in many fields. For example, in personalized medicine, we seek to map patient …
data arises in many fields. For example, in personalized medicine, we seek to map patient …
Decision making with inference and learning methods
MW Hoffman - 2013 - open.library.ubc.ca
In this work we consider probabilistic approaches to sequential decision making. The
ultimate goal is to provide methods by which decision making problems can be attacked by …
ultimate goal is to provide methods by which decision making problems can be attacked by …
Faster rates for policy learning
This article improves the existing proven rates of regret decay in optimal policy estimation.
We give a margin-free result showing that the regret decay for estimating a within-class …
We give a margin-free result showing that the regret decay for estimating a within-class …
Estimation considerations in contextual bandits
Contextual bandit algorithms are sensitive to the estimation method of the outcome model as
well as the exploration method used, particularly in the presence of rich heterogeneity or …
well as the exploration method used, particularly in the presence of rich heterogeneity or …
Learning to optimize via posterior sampling
This paper considers the use of a simple posterior sampling algorithm to balance between
exploration and exploitation when learning to optimize actions such as in multiarmed bandit …
exploration and exploitation when learning to optimize actions such as in multiarmed bandit …
Distributionally robust batch contextual bandits
Policy learning using historical observational data are an important problem that has
widespread applications. Examples include selecting offers, prices, or advertisements for …
widespread applications. Examples include selecting offers, prices, or advertisements for …
Beyond variance reduction: Understanding the true impact of baselines on policy optimization
Bandit and reinforcement learning (RL) problems can often be framed as optimization
problems where the goal is to maximize average performance while having access only to …
problems where the goal is to maximize average performance while having access only to …
Bandits with unobserved confounders: A causal approach
Abstract The Multi-Armed Bandit problem constitutes an archetypal setting for sequential
decision-making, permeating multiple domains including engineering, business, and …
decision-making, permeating multiple domains including engineering, business, and …