Black-box off-policy estimation for infinite-horizon reinforcement learning

S Levine, A Kumar, G Tucker, J Fu - arXiv preprint arXiv:2005.01643, 2020 - arxiv.org

In this tutorial article, we aim to provide the reader with the conceptual tools needed to get
started on research on offline reinforcement learning algorithms: reinforcement learning …

被引用次数：1781 相关文章所有 3 个版本

[PDF] mlr.press

Gradientdice: Rethinking generalized offline estimation of stationary values

S Zhang, B Liu, S Whiteson - International Conference on …, 2020 - proceedings.mlr.press

We present GradientDICE for estimating the density ratio between the state distribution of
the target policy and the sampling distribution in off-policy reinforcement learning …

被引用次数：101 相关文章所有 8 个版本

[PDF] mlr.press

Learning and planning in average-reward markov decision processes

Y Wan, A Naik, RS Sutton - International Conference on …, 2021 - proceedings.mlr.press

We introduce learning and planning algorithms for average-reward MDPs, including 1) the
first general proven-convergent off-policy model-free control algorithm without reference …

被引用次数：66 相关文章所有 9 个版本

[PDF] mlr.press

Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders

A Bennett, N Kallus, L Li… - … Conference on Artificial …, 2021 - proceedings.mlr.press

Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings
where experimentation is limited, such as healthcare. But, in these very same settings …

被引用次数：48 相关文章所有 6 个版本

[PDF] neurips.cc

A unified framework for alternating offline model training and policy learning

S Yang, S Zhang, Y Feng… - Advances in Neural …, 2022 - proceedings.neurips.cc

In offline model-based reinforcement learning (offline MBRL), we learn a dynamic model
from historically collected data, and subsequently utilize the learned model and fixed …

被引用次数：11 相关文章所有 8 个版本

[PDF] mlr.press

Average-reward off-policy policy evaluation with function approximation

S Zhang, Y Wan, RS Sutton… - … conference on machine …, 2021 - proceedings.mlr.press

We consider off-policy policy evaluation with function approximation (FA) in average-reward
MDPs, where the goal is to estimate both the reward rate and the differential value function …

被引用次数：35 相关文章所有 8 个版本

[PDF] mlr.press

Accountable off-policy evaluation with kernel bellman statistics

Y Feng, T Ren, Z Tang, Q Liu - International Conference on …, 2020 - proceedings.mlr.press

We consider off-policy evaluation (OPE), which evaluates the performance of a new policy
from observed data collected from previous experiments, without requiring the execution of …

被引用次数：43 相关文章所有 5 个版本

[PDF] arxiv.org

Offline reinforcement learning with soft behavior regularization

H Xu, X Zhan, J Li, H Yin - arXiv preprint arXiv:2110.07395, 2021 - arxiv.org

Most prior approaches to offline reinforcement learning (RL) utilize\textit {behavior
regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty …

被引用次数：28 相关文章所有 3 个版本

[PDF] springer.com

Importance sampling in reinforcement learning with an estimated behavior policy

JP Hanna, S Niekum, P Stone - Machine Learning, 2021 - Springer

In reinforcement learning, importance sampling is a widely used method for evaluating an
expectation under the distribution of data of one policy when the data has in fact been …

被引用次数：32 相关文章所有 13 个版本

[PDF] aaai.org

Mean-variance policy iteration for risk-averse reinforcement learning

S Zhang, B Liu, S Whiteson - Proceedings of the AAAI Conference on …, 2021 - ojs.aaai.org

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a
discounted infinite horizon MDP optimizing the variance of a per-step reward random …

被引用次数：38 相关文章所有 9 个版本