Online robust reinforcement learning with model uncertainty

Y Wang, S Zou - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc
Robust reinforcement learning (RL) is to find a policy that optimizes the worst-case
performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust …

A finite time analysis of temporal difference learning with linear function approximation

J Bhandari, D Russo, R Singal - Conference on learning …, 2018 - proceedings.mlr.press
Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value
function corresponding to a given policy in a Markov decision process. Although TD is one of …

Global convergence of policy gradient methods to (almost) locally optimal policies

K Zhang, A Koppel, H Zhu, T Basar - SIAM Journal on Control and …, 2020 - SIAM
Policy gradient (PG) methods have been one of the most essential ingredients of
reinforcement learning, with application in a variety of domains. In spite of the empirical …

Finite-sample analysis for sarsa with linear function approximation

S Zou, T Xu, Y Liang - Advances in neural information …, 2019 - proceedings.neurips.cc
SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement
learning. We investigate the SARSA algorithm with linear function approximation under the …

A finite-time analysis of two time-scale actor-critic methods

YF Wu, W Zhang, P Xu, Q Gu - Advances in Neural …, 2020 - proceedings.neurips.cc
Actor-critic (AC) methods have exhibited great empirical success compared with other
reinforcement learning algorithms, where the actor uses the policy gradient to improve the …

On finite-time convergence of actor-critic algorithm

S Qiu, Z Yang, J Ye, Z Wang - IEEE Journal on Selected Areas …, 2021 - ieeexplore.ieee.org
Actor-critic algorithm and their extensions have made great achievements in real-world
decision-making problems. In contrast to its empirical success, the theoretical understanding …

A single-timescale method for stochastic bilevel optimization

T Chen, Y Sun, Q Xiao, W Yin - International Conference on …, 2022 - proceedings.mlr.press
Stochastic bilevel optimization generalizes the classic stochastic optimization from the
minimization of a single objective to the minimization of an objective function that depends …

A Single-Timescale Method for Stochastic Bilevel Optimization

T Chen, Y Sun, Q Xiao, W Yin - arXiv preprint arXiv:2102.04671, 2021 - arxiv.org
Stochastic bilevel optimization generalizes the classic stochastic optimization from the
minimization of a single objective to the minimization of an objective function that depends …

On the sample complexity of actor-critic method for reinforcement learning with function approximation

H Kumar, A Koppel, A Ribeiro - Machine Learning, 2023 - Springer
Reinforcement learning, mathematically described by Markov Decision Problems, may be
approached either through dynamic programming or policy search. Actor-critic algorithms …

Finite-time analysis of whittle index based Q-learning for restless multi-armed bandits with neural network function approximation

G Xiong, J Li - Advances in Neural Information Processing …, 2023 - proceedings.neurips.cc
Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB)
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …