An instrumental variable approach to confounded off-policy evaluation

Y Xu, J Zhu, C Shi, S Luo… - … Conference on Machine …, 2023 - proceedings.mlr.press
Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-
collected observational data generated by a potentially different behavior policy. In many …

Off-policy evaluation for human feedback

Q Gao, G Gao, J Dong, V Tarokh… - Advances in Neural …, 2023 - proceedings.neurips.cc
Off-policy evaluation (OPE) is important for closing the gap between offline training and
evaluation of reinforcement learning (RL), by estimating performance and/or rank of target …

Distributional shift-aware off-policy interval estimation: A unified error quantification framework

W Zhou, Y Li, R Zhu, A Qu - arXiv preprint arXiv:2309.13278, 2023 - arxiv.org
We study high-confidence off-policy evaluation in the context of infinite-horizon Markov
decision processes, where the objective is to establish a confidence interval (CI) for the …

Sample complexity of nonparametric off-policy evaluation on low-dimensional manifolds using deep networks

X Ji, M Chen, M Wang, T Zhao - arXiv preprint arXiv:2206.02887, 2022 - arxiv.org
We consider the off-policy evaluation problem of reinforcement learning using deep
convolutional neural networks. We analyze the deep fitted Q-evaluation method for …

Policy-adaptive estimator selection for off-policy evaluation

T Udagawa, H Kiyohara, Y Narita, Y Saito… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual
policies using only offline logged data. Although many estimators have been developed …

Optimal treatment allocation for efficient policy evaluation in sequential decision making

T Li, C Shi, J Wang, F Zhou - Advances in Neural …, 2024 - proceedings.neurips.cc
A/B testing is critical for modern technological companies to evaluate the effectiveness of
newly developed products against standard baselines. This paper studies optimal designs …

Provable benefits of policy learning from human preferences in contextual bandit problems

X Ji, H Wang, M Chen, T Zhao, M Wang - arXiv preprint arXiv:2307.12975, 2023 - arxiv.org
A crucial task in decision-making problems is reward engineering. It is common in practice
that no obvious choice of reward function exists. Thus, a popular approach is to introduce …

A reinforcement learning framework for dynamic mediation analysis

L Ge, J Wang, C Shi, Z Wu… - … Conference on Machine …, 2023 - proceedings.mlr.press
Mediation analysis learns the causal effect transmitted via mediator variables between
treatments and outcomes, and receives increasing attention in various scientific domains to …

Development and validation of a reinforcement learning model for ventilation control during emergence from general anesthesia

H Lee, HK Yoon, J Kim, JS Park, CH Koo, D Won… - npj Digital …, 2023 - nature.com
Ventilation should be assisted without asynchrony or cardiorespiratory instability during
anesthesia emergence until sufficient spontaneous ventilation is recovered. In this …

Did we personalize? assessing personalization by an online reinforcement learning algorithm using resampling

S Ghosh, R Kim, P Chhabria, R Dwivedi, P Klasnja… - Machine Learning, 2024 - Springer
There is a growing interest in using reinforcement learning (RL) to personalize sequences of
treatments in digital health to support users in adopting healthier behaviors. Such sequential …