A review of off-policy evaluation in reinforcement learning

M Uehara, C Shi, N Kallus - arXiv preprint arXiv:2212.06355, 2022 - arxiv.org
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine
learning and has been recently applied to solve a number of challenging problems. In this …

A survey on causal reinforcement learning

Y Zeng, R Cai, F Sun, L Huang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
While reinforcement learning (RL) achieves tremendous success in sequential decision-
making problems of many domains, it still faces key challenges of data inefficiency and the …

A minimax learning approach to off-policy evaluation in confounded partially observable markov decision processes

C Shi, M Uehara, J Huang… - … Conference on Machine …, 2022 - proceedings.mlr.press
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes
(POMDPs), where the evaluation policy depends only on observable variables and the …

An instrumental variable approach to confounded off-policy evaluation

Y Xu, J Zhu, C Shi, S Luo… - … Conference on Machine …, 2023 - proceedings.mlr.press
Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-
collected observational data generated by a potentially different behavior policy. In many …

Future-dependent value-based off-policy evaluation in pomdps

M Uehara, H Kiyohara, A Bennett… - Advances in …, 2024 - proceedings.neurips.cc
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general
function approximation. Existing methods such as sequential importance sampling …

Offline reinforcement learning with instrumental variables in confounded markov decision processes

Z Fu, Z Qi, Z Wang, Z Yang, Y Xu… - arXiv preprint arXiv …, 2022 - arxiv.org
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
Due to the lack of online interaction with the environment, offline RL is facing the following …

Off-policy evaluation for episodic partially observable markov decision processes under non-parametric models

R Miao, Z Qi, X Zhang - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We study the problem of off-policy evaluation (OPE) for episodic Partially Observable
Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently …

Optimal treatment allocation for efficient policy evaluation in sequential decision making

T Li, C Shi, J Wang, F Zhou - Advances in Neural …, 2024 - proceedings.neurips.cc
A/B testing is critical for modern technological companies to evaluate the effectiveness of
newly developed products against standard baselines. This paper studies optimal designs …

Estimating and improving dynamic treatment regimes with a time-varying instrumental variable

S Chen, B Zhang - Journal of the Royal Statistical Society Series …, 2023 - academic.oup.com
Estimating dynamic treatment regimes (DTRs) from retrospective observational data is
challenging as some degree of unmeasured confounding is often expected. In this work, we …

A reinforcement learning framework for dynamic mediation analysis

L Ge, J Wang, C Shi, Z Wu… - … Conference on Machine …, 2023 - proceedings.mlr.press
Mediation analysis learns the causal effect transmitted via mediator variables between
treatments and outcomes, and receives increasing attention in various scientific domains to …