Offline reinforcement learning: Tutorial, review, and perspectives on open problems
In this tutorial article, we aim to provide the reader with the conceptual tools needed to get
started on research on offline reinforcement learning algorithms: reinforcement learning …
started on research on offline reinforcement learning algorithms: reinforcement learning …
Gradientdice: Rethinking generalized offline estimation of stationary values
We present GradientDICE for estimating the density ratio between the state distribution of
the target policy and the sampling distribution in off-policy reinforcement learning …
the target policy and the sampling distribution in off-policy reinforcement learning …
Learning and planning in average-reward markov decision processes
We introduce learning and planning algorithms for average-reward MDPs, including 1) the
first general proven-convergent off-policy model-free control algorithm without reference …
first general proven-convergent off-policy model-free control algorithm without reference …
Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders
Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings
where experimentation is limited, such as healthcare. But, in these very same settings …
where experimentation is limited, such as healthcare. But, in these very same settings …
A unified framework for alternating offline model training and policy learning
In offline model-based reinforcement learning (offline MBRL), we learn a dynamic model
from historically collected data, and subsequently utilize the learned model and fixed …
from historically collected data, and subsequently utilize the learned model and fixed …
Average-reward off-policy policy evaluation with function approximation
We consider off-policy policy evaluation with function approximation (FA) in average-reward
MDPs, where the goal is to estimate both the reward rate and the differential value function …
MDPs, where the goal is to estimate both the reward rate and the differential value function …
Accountable off-policy evaluation with kernel bellman statistics
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy
from observed data collected from previous experiments, without requiring the execution of …
from observed data collected from previous experiments, without requiring the execution of …
Offline reinforcement learning with soft behavior regularization
Most prior approaches to offline reinforcement learning (RL) utilize\textit {behavior
regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty …
regularization}, typically augmenting existing off-policy actor critic algorithms with a penalty …
Importance sampling in reinforcement learning with an estimated behavior policy
In reinforcement learning, importance sampling is a widely used method for evaluating an
expectation under the distribution of data of one policy when the data has in fact been …
expectation under the distribution of data of one policy when the data has in fact been …
Mean-variance policy iteration for risk-averse reinforcement learning
We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a
discounted infinite horizon MDP optimizing the variance of a per-step reward random …
discounted infinite horizon MDP optimizing the variance of a per-step reward random …