Bootstrapping statistical inference for off-policy evaluation

A Zanette, MJ Wainwright… - Advances in neural …, 2021 - proceedings.neurips.cc

Actor-critic methods are widely used in offline reinforcement learningpractice, but are not so
well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally …

被引用次数：127 相关文章所有 8 个版本

[PDF] tandfonline.com

Off-policy confidence interval estimation with confounded markov decision process

C Shi, J Zhu, Y Shen, S Luo, H Zhu… - Journal of the American …, 2024 - Taylor & Francis

This article is concerned with constructing a confidence interval for a target policy's value
offline based on a pre-collected observational data in infinite horizon settings. Most of the …

被引用次数：39 相关文章所有 11 个版本

[PDF] arxiv.org

Online bootstrap inference for policy evaluation in reinforcement learning

P Ramprasad, Y Li, Z Yang, Z Wang… - Journal of the …, 2023 - Taylor & Francis

The recent emergence of reinforcement learning (RL) has created a demand for robust
statistical inference methods for the parameter estimates computed using these algorithms …

被引用次数：34 相关文章所有 9 个版本

[PDF] arxiv.org

Hope: Human-centric off-policy evaluation for e-learning and healthcare

G Gao, S Ju, MS Ausin, M Chi - arXiv preprint arXiv:2302.09212, 2023 - arxiv.org

Reinforcement learning (RL) has been extensively researched for enhancing human-
environment interactions in various human-centric tasks, including e-learning and …

被引用次数：11 相关文章所有 6 个版本

[PDF] tandfonline.com

Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework

C Shi, X Wang, S Luo, H Zhu, J Ye… - Journal of the American …, 2023 - Taylor & Francis

A/B testing, or online experiment is a standard business strategy to compare a new product
with an old one in pharmaceutical, technological, and traditional industries. Major …

被引用次数：46 相关文章所有 10 个版本

[PDF] mlr.press

A statistical analysis of polyak-ruppert averaged q-learning

X Li, W Yang, J Liang, Z Zhang… - … Conference on Artificial …, 2023 - proceedings.mlr.press

We study Q-learning with Polyak-Ruppert averaging (aka, averaged Q-learning) in a
discounted markov decision process in synchronous and tabular settings. Under a Lipschitz …

被引用次数：13 相关文章所有 5 个版本

[PDF] openreview.net

On trajectory augmentations for off-policy evaluation

G Gao, Q Gao, X Yang, S Ju, M Pajic… - The Twelfth International …, 2024 - openreview.net

In the realm of reinforcement learning (RL), off-policy evaluation (OPE) holds a pivotal
position, especially in high-stake human-involved scenarios such as e-learning and …

被引用次数：2 相关文章所有 3 个版本

[PDF] neurips.cc

Bellman residual orthogonalization for offline reinforcement learning

A Zanette, MJ Wainwright - Advances in Neural Information …, 2022 - proceedings.neurips.cc

We propose and analyze a reinforcement learning principle thatapproximates the Bellman
equations by enforcing their validity onlyalong a user-defined space of test functions …

被引用次数：10 相关文章所有 6 个版本

[PDF] mlr.press

Debiasing samples from online learning using bootstrap

N Chen, X Gao, Y Xiong - International Conference on …, 2022 - proceedings.mlr.press

It has been recently shown in the literature (Nie et al, 2018; Shin et al, 2019a, b) that the
sample averages from online learning experiments are biased when used to estimate the …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds

Y Feng, Z Tang, N Zhang, Q Liu - arXiv preprint arXiv:2103.05741, 2021 - arxiv.org

Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy
based on offline data previously collected under different policies. Therefore, OPE is a key …

被引用次数：12 相关文章所有 3 个版本