Online robust reinforcement learning with model uncertainty
Robust reinforcement learning (RL) is to find a policy that optimizes the worst-case
performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust …
performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust …
A finite time analysis of temporal difference learning with linear function approximation
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value
function corresponding to a given policy in a Markov decision process. Although TD is one of …
function corresponding to a given policy in a Markov decision process. Although TD is one of …
Global convergence of policy gradient methods to (almost) locally optimal policies
Policy gradient (PG) methods have been one of the most essential ingredients of
reinforcement learning, with application in a variety of domains. In spite of the empirical …
reinforcement learning, with application in a variety of domains. In spite of the empirical …
Finite-sample analysis for sarsa with linear function approximation
SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement
learning. We investigate the SARSA algorithm with linear function approximation under the …
learning. We investigate the SARSA algorithm with linear function approximation under the …
A finite-time analysis of two time-scale actor-critic methods
Actor-critic (AC) methods have exhibited great empirical success compared with other
reinforcement learning algorithms, where the actor uses the policy gradient to improve the …
reinforcement learning algorithms, where the actor uses the policy gradient to improve the …
On finite-time convergence of actor-critic algorithm
Actor-critic algorithm and their extensions have made great achievements in real-world
decision-making problems. In contrast to its empirical success, the theoretical understanding …
decision-making problems. In contrast to its empirical success, the theoretical understanding …
A single-timescale method for stochastic bilevel optimization
Stochastic bilevel optimization generalizes the classic stochastic optimization from the
minimization of a single objective to the minimization of an objective function that depends …
minimization of a single objective to the minimization of an objective function that depends …
A Single-Timescale Method for Stochastic Bilevel Optimization
Stochastic bilevel optimization generalizes the classic stochastic optimization from the
minimization of a single objective to the minimization of an objective function that depends …
minimization of a single objective to the minimization of an objective function that depends …
On the sample complexity of actor-critic method for reinforcement learning with function approximation
Reinforcement learning, mathematically described by Markov Decision Problems, may be
approached either through dynamic programming or policy search. Actor-critic algorithms …
approached either through dynamic programming or policy search. Actor-critic algorithms …
Finite-time analysis of whittle index based Q-learning for restless multi-armed bandits with neural network function approximation
G Xiong, J Li - Advances in Neural Information Processing …, 2023 - proceedings.neurips.cc
Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB)
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …