A review of off-policy evaluation in reinforcement learning

M Uehara, C Shi, N Kallus - arXiv preprint arXiv:2212.06355, 2022 - arxiv.org
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine
learning and has been recently applied to solve a number of challenging problems. In this …

Statistical inference of the value function for reinforcement learning in infinite-horizon settings

C Shi, S Zhang, W Lu, R Song - Journal of the Royal Statistical …, 2022 - academic.oup.com
Reinforcement learning is a general technique that allows an agent to learn an optimal
policy and interact with an environment in sequential decision-making problems. The …

Optimal treatment regimes: a review and empirical comparison

Z Li, J Chen, E Laber, F Liu… - International Statistical …, 2023 - Wiley Online Library
A treatment regime is a sequence of decision rules, one per decision point, that maps
accumulated patient information to a recommended intervention. An optimal treatment …

Deeply-debiased off-policy interval estimation

C Shi, R Wan, V Chernozhukov… - … conference on machine …, 2021 - proceedings.mlr.press
Off-policy evaluation learns a target policy's value with a historical dataset generated by a
different behavior policy. In addition to a point estimate, many applications would benefit …

Estimating and improving dynamic treatment regimes with a time-varying instrumental variable

S Chen, B Zhang - Journal of the Royal Statistical Society Series …, 2023 - academic.oup.com
Estimating dynamic treatment regimes (DTRs) from retrospective observational data is
challenging as some degree of unmeasured confounding is often expected. In this work, we …

Transfer learning for contextual multi-armed bandits

C Cai, TT Cai, H Li - The Annals of Statistics, 2024 - projecteuclid.org
Transfer learning for contextual multi-armed bandits Page 1 The Annals of Statistics 2024,
Vol. 52, No. 1, 207–232 https://doi.org/10.1214/23-AOS2341 © Institute of Mathematical …

A multi-agent reinforcement learning framework for off-policy evaluation in two-sided markets

C Shi, R Wan, G Song, S Luo, R Song… - arXiv preprint arXiv …, 2022 - arxiv.org
The two-sided markets such as ride-sharing companies often involve a group of subjects
who are making sequential decisions across time and/or location. With the rapid …

Deep jump learning for off-policy evaluation in continuous treatment settings

H Cai, C Shi, R Song, W Lu - Advances in Neural …, 2021 - proceedings.neurips.cc
We consider off-policy evaluation (OPE) in continuous treatment settings, such as
personalized dose-finding. In OPE, one aims to estimate the mean outcome under a new …

Statistically efficient advantage learning for offline reinforcement learning in infinite horizons

C Shi, S Luo, Y Le, H Zhu, R Song - Journal of the American …, 2024 - Taylor & Francis
We consider reinforcement learning (RL) methods in offline domains without additional
online data collection, such as mobile health applications. Most of existing policy …

Evaluating dynamic conditional quantile treatment effects with applications in ridesharing

T Li, C Shi, Z Lu, Y Li, H Zhu - Journal of the American Statistical …, 2024 - Taylor & Francis
Many modern tech companies, such as Google, Uber, and Didi, use online experiments
(also known as A/B testing) to evaluate new policies against existing ones. While most …