相关文章- 学术资源搜索

Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance

T Kitagawa, J Rowley - arXiv preprint arXiv:2409.00379, 2024 - arxiv.org

Static supervised learning-in which experimental data serves as a training sample for the
estimation of an optimal treatment assignment policy-is a commonly assumed framework of …

Functional sequential treatment allocation

AB Kock, D Preinerstorfer, B Veliyev - Journal of the American …, 2022 - Taylor & Francis

Consider a setting in which a policy maker assigns subjects to treatments, observing each
outcome before the next subject arrives. Initially, it is unknown which treatment is best, but …

被引用次数：13 相关文章所有 15 个版本

[PDF] theoj.org

[PDF][PDF] policytree: Policy learning via doubly robust empirical welfare maximization over trees

E Sverdrup, A Kanodia, Z Zhou, S Athey… - Journal of Open Source …, 2020 - joss.theoj.org

The problem of learning treatment assignment policies from randomized or observational
data arises in many fields. For example, in personalized medicine, we seek to map patient …

被引用次数：54 相关文章所有 6 个版本

[PDF] ubc.ca

Decision making with inference and learning methods

MW Hoffman - 2013 - open.library.ubc.ca

In this work we consider probabilistic approaches to sequential decision making. The
ultimate goal is to provide methods by which decision making problems can be attacked by …

被引用次数：3 相关文章所有 4 个版本

[PDF] arxiv.org

Faster rates for policy learning

A Luedtke, A Chambaz - arXiv preprint arXiv:1704.06431, 2017 - arxiv.org

This article improves the existing proven rates of regret decay in optimal policy estimation.
We give a margin-free result showing that the regret decay for estimating a within-class …

被引用次数：12 相关文章所有 5 个版本

[PDF] arxiv.org

Estimation considerations in contextual bandits

M Dimakopoulou, Z Zhou, S Athey… - arXiv preprint arXiv …, 2017 - arxiv.org

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as
well as the exploration method used, particularly in the presence of rich heterogeneity or …

被引用次数：234 相关文章所有 6 个版本

[PDF] arxiv.org

Learning to optimize via posterior sampling

D Russo, B Van Roy - Mathematics of Operations Research, 2014 - pubsonline.informs.org

This paper considers the use of a simple posterior sampling algorithm to balance between
exploration and exploitation when learning to optimize actions such as in multiarmed bandit …

被引用次数：758 相关文章所有 17 个版本

[PDF] arxiv.org

Distributionally robust batch contextual bandits

N Si, F Zhang, Z Zhou, J Blanchet - Management Science, 2023 - pubsonline.informs.org

Policy learning using historical observational data are an important problem that has
widespread applications. Examples include selecting offers, prices, or advertisements for …

被引用次数：35 相关文章所有 8 个版本

[PDF] mlr.press

Beyond variance reduction: Understanding the true impact of baselines on policy optimization

W Chung, V Thomas, MC Machado… - … on Machine Learning, 2021 - proceedings.mlr.press

Bandit and reinforcement learning (RL) problems can often be framed as optimization
problems where the goal is to maximize average performance while having access only to …

被引用次数：30 相关文章所有 4 个版本

[PDF] neurips.cc

Bandits with unobserved confounders: A causal approach

E Bareinboim, A Forney, J Pearl - Advances in Neural …, 2015 - proceedings.neurips.cc

Abstract The Multi-Armed Bandit problem constitutes an archetypal setting for sequential
decision-making, permeating multiple domains including engineering, business, and …

被引用次数：195 相关文章所有 7 个版本