Statistical inference of the value function for reinforcement learning in infinite-horizon settings

M Uehara, C Shi, N Kallus - arXiv preprint arXiv:2212.06355, 2022 - arxiv.org

Reinforcement learning (RL) is one of the most vibrant research frontiers in machine
learning and has been recently applied to solve a number of challenging problems. In this …

被引用次数：67 相关文章所有 2 个版本

[HTML] nih.gov

[HTML][HTML] Batch policy learning in average reward markov decision processes

P Liao, Z Qi, R Wan, P Klasnja, SA Murphy - Annals of statistics, 2022 - ncbi.nlm.nih.gov

We consider the batch (off-line) policy learning problem in the infinite horizon Markov
Decision Process. Motivated by mobile health applications, we focus on learning a policy …

被引用次数：92 相关文章所有 9 个版本

[PDF] arxiv.org

Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency

M Uehara, M Imaizumi, N Jiang, N Kallus… - arXiv preprint arXiv …, 2021 - arxiv.org

We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement
learning using function approximation for marginal importance weights and $ q $-functions …

被引用次数：68 相关文章所有 2 个版本

[PDF] mlr.press

On well-posedness and minimax optimal rates of nonparametric q-function estimation in off-policy evaluation

X Chen, Z Qi - International Conference on Machine …, 2022 - proceedings.mlr.press

We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision
process with continuous states and actions. We recast the $ Q $-function estimation into a …

被引用次数：38 相关文章所有 5 个版本

[PDF] tandfonline.com

Off-policy confidence interval estimation with confounded markov decision process

C Shi, J Zhu, Y Shen, S Luo, H Zhu… - Journal of the American …, 2024 - Taylor & Francis

This article is concerned with constructing a confidence interval for a target policy's value
offline based on a pre-collected observational data in infinite horizon settings. Most of the …

被引用次数：44 相关文章所有 11 个版本

[PDF] mlr.press

An instrumental variable approach to confounded off-policy evaluation

Y Xu, J Zhu, C Shi, S Luo… - … Conference on Machine …, 2023 - proceedings.mlr.press

Off-policy evaluation (OPE) aims to estimate the return of a target policy using some pre-
collected observational data generated by a potentially different behavior policy. In many …

被引用次数：16 相关文章所有 7 个版本

[PDF] neurips.cc

Future-dependent value-based off-policy evaluation in pomdps

M Uehara, H Kiyohara, A Bennett… - Advances in …, 2024 - proceedings.neurips.cc

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general
function approximation. Existing methods such as sequential importance sampling …

被引用次数：20 相关文章所有 8 个版本

[PDF] mlr.press

Deeply-debiased off-policy interval estimation

C Shi, R Wan, V Chernozhukov… - … conference on machine …, 2021 - proceedings.mlr.press

Off-policy evaluation learns a target policy's value with a historical dataset generated by a
different behavior policy. In addition to a point estimate, many applications would benefit …

被引用次数：41 相关文章所有 5 个版本

[PDF] arxiv.org

Online bootstrap inference for policy evaluation in reinforcement learning

P Ramprasad, Y Li, Z Yang, Z Wang… - Journal of the …, 2023 - Taylor & Francis

The recent emergence of reinforcement learning (RL) has created a demand for robust
statistical inference methods for the parameter estimates computed using these algorithms …

被引用次数：38 相关文章所有 9 个版本

[PDF] mlr.press

Bootstrapping fitted q-evaluation for off-policy inference

B Hao, X Ji, Y Duan, H Lu… - International …, 2021 - proceedings.mlr.press

Bootstrapping provides a flexible and effective approach for assessing the quality of batch
reinforcement learning, yet its theoretical properties are poorly understood. In this paper, we …

被引用次数：47 相关文章所有 6 个版本