Provable benefits of actor-critic methods for offline reinforcement learning
A Zanette, MJ Wainwright… - Advances in neural …, 2021 - proceedings.neurips.cc
Actor-critic methods are widely used in offline reinforcement learningpractice, but are not so
well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally …
well-understood theoretically. We propose a newoffline actor-critic algorithm that naturally …
Off-policy confidence interval estimation with confounded markov decision process
This article is concerned with constructing a confidence interval for a target policy's value
offline based on a pre-collected observational data in infinite horizon settings. Most of the …
offline based on a pre-collected observational data in infinite horizon settings. Most of the …
Online bootstrap inference for policy evaluation in reinforcement learning
The recent emergence of reinforcement learning (RL) has created a demand for robust
statistical inference methods for the parameter estimates computed using these algorithms …
statistical inference methods for the parameter estimates computed using these algorithms …
Hope: Human-centric off-policy evaluation for e-learning and healthcare
Reinforcement learning (RL) has been extensively researched for enhancing human-
environment interactions in various human-centric tasks, including e-learning and …
environment interactions in various human-centric tasks, including e-learning and …
Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework
A/B testing, or online experiment is a standard business strategy to compare a new product
with an old one in pharmaceutical, technological, and traditional industries. Major …
with an old one in pharmaceutical, technological, and traditional industries. Major …
A statistical analysis of polyak-ruppert averaged q-learning
We study Q-learning with Polyak-Ruppert averaging (aka, averaged Q-learning) in a
discounted markov decision process in synchronous and tabular settings. Under a Lipschitz …
discounted markov decision process in synchronous and tabular settings. Under a Lipschitz …
On trajectory augmentations for off-policy evaluation
In the realm of reinforcement learning (RL), off-policy evaluation (OPE) holds a pivotal
position, especially in high-stake human-involved scenarios such as e-learning and …
position, especially in high-stake human-involved scenarios such as e-learning and …
Bellman residual orthogonalization for offline reinforcement learning
A Zanette, MJ Wainwright - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We propose and analyze a reinforcement learning principle thatapproximates the Bellman
equations by enforcing their validity onlyalong a user-defined space of test functions …
equations by enforcing their validity onlyalong a user-defined space of test functions …
Debiasing samples from online learning using bootstrap
It has been recently shown in the literature (Nie et al, 2018; Shin et al, 2019a, b) that the
sample averages from online learning experiments are biased when used to estimate the …
sample averages from online learning experiments are biased when used to estimate the …
Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy
based on offline data previously collected under different policies. Therefore, OPE is a key …
based on offline data previously collected under different policies. Therefore, OPE is a key …