Provable benefits of actor-critic methods for offline reinforcement learning

A Zanette, MJ Wainwright… - Advances in neural …, 2021 - proceedings.neurips.cc
Actor-critic methods are widely used in offline reinforcement learningpractice, but are not so
well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally …

Off-policy confidence interval estimation with confounded markov decision process

C Shi, J Zhu, Y Shen, S Luo, H Zhu… - Journal of the American …, 2024 - Taylor & Francis
This article is concerned with constructing a confidence interval for a target policy's value
offline based on a pre-collected observational data in infinite horizon settings. Most of the …

Online bootstrap inference for policy evaluation in reinforcement learning

P Ramprasad, Y Li, Z Yang, Z Wang… - Journal of the …, 2023 - Taylor & Francis
The recent emergence of reinforcement learning (RL) has created a demand for robust
statistical inference methods for the parameter estimates computed using these algorithms …

Hope: Human-centric off-policy evaluation for e-learning and healthcare

G Gao, S Ju, MS Ausin, M Chi - arXiv preprint arXiv:2302.09212, 2023 - arxiv.org
Reinforcement learning (RL) has been extensively researched for enhancing human-
environment interactions in various human-centric tasks, including e-learning and …

Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework

C Shi, X Wang, S Luo, H Zhu, J Ye… - Journal of the American …, 2023 - Taylor & Francis
A/B testing, or online experiment is a standard business strategy to compare a new product
with an old one in pharmaceutical, technological, and traditional industries. Major …

A statistical analysis of polyak-ruppert averaged q-learning

X Li, W Yang, J Liang, Z Zhang… - … Conference on Artificial …, 2023 - proceedings.mlr.press
We study Q-learning with Polyak-Ruppert averaging (aka, averaged Q-learning) in a
discounted markov decision process in synchronous and tabular settings. Under a Lipschitz …

On trajectory augmentations for off-policy evaluation

G Gao, Q Gao, X Yang, S Ju, M Pajic… - The Twelfth International …, 2024 - openreview.net
In the realm of reinforcement learning (RL), off-policy evaluation (OPE) holds a pivotal
position, especially in high-stake human-involved scenarios such as e-learning and …

Bellman residual orthogonalization for offline reinforcement learning

A Zanette, MJ Wainwright - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We propose and analyze a reinforcement learning principle thatapproximates the Bellman
equations by enforcing their validity onlyalong a user-defined space of test functions …

Debiasing samples from online learning using bootstrap

N Chen, X Gao, Y Xiong - International Conference on …, 2022 - proceedings.mlr.press
It has been recently shown in the literature (Nie et al, 2018; Shin et al, 2019a, b) that the
sample averages from online learning experiments are biased when used to estimate the …

Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds

Y Feng, Z Tang, N Zhang, Q Liu - arXiv preprint arXiv:2103.05741, 2021 - arxiv.org
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy
based on offline data previously collected under different policies. Therefore, OPE is a key …