Offline reinforcement learning: Tutorial, review, and perspectives on open problems

S Levine, A Kumar, G Tucker, J Fu - arXiv preprint arXiv:2005.01643, 2020 - arxiv.org
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get
started on research on offline reinforcement learning algorithms: reinforcement learning …

Gradientdice: Rethinking generalized offline estimation of stationary values

S Zhang, B Liu, S Whiteson - International Conference on …, 2020 - proceedings.mlr.press
We present GradientDICE for estimating the density ratio between the state distribution of
the target policy and the sampling distribution in off-policy reinforcement learning …

Learning and planning in average-reward markov decision processes

Y Wan, A Naik, RS Sutton - International Conference on …, 2021 - proceedings.mlr.press
We introduce learning and planning algorithms for average-reward MDPs, including 1) the
first general proven-convergent off-policy model-free control algorithm without reference …

Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders

A Bennett, N Kallus, L Li… - … Conference on Artificial …, 2021 - proceedings.mlr.press
Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings
where experimentation is limited, such as healthcare. But, in these very same settings …

A unified framework for alternating offline model training and policy learning

S Yang, S Zhang, Y Feng… - Advances in Neural …, 2022 - proceedings.neurips.cc
In offline model-based reinforcement learning (offline MBRL), we learn a dynamic model
from historically collected data, and subsequently utilize the learned model and fixed …

Average-reward off-policy policy evaluation with function approximation

S Zhang, Y Wan, RS Sutton… - … conference on machine …, 2021 - proceedings.mlr.press
We consider off-policy policy evaluation with function approximation (FA) in average-reward
MDPs, where the goal is to estimate both the reward rate and the differential value function …

Accountable off-policy evaluation with kernel bellman statistics

Y Feng, T Ren, Z Tang, Q Liu - International Conference on …, 2020 - proceedings.mlr.press
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy
from observed data collected from previous experiments, without requiring the execution of …

Offline reinforcement learning with soft behavior regularization

H Xu, X Zhan, J Li, H Yin - arXiv preprint arXiv:2110.07395, 2021 - arxiv.org
Most prior approaches to offline reinforcement learning (RL) utilize\textit {behavior
regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty …

Importance sampling in reinforcement learning with an estimated behavior policy

JP Hanna, S Niekum, P Stone - Machine Learning, 2021 - Springer
In reinforcement learning, importance sampling is a widely used method for evaluating an
expectation under the distribution of data of one policy when the data has in fact been …

Mean-variance policy iteration for risk-averse reinforcement learning

S Zhang, B Liu, S Whiteson - Proceedings of the AAAI Conference on …, 2021 - ojs.aaai.org
We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a
discounted infinite horizon MDP optimizing the variance of a per-step reward random …