Finite sample analysis of two-timescale stochastic approximation with applications to reinforceme...

Y Wang, S Zou - Advances in Neural Information Processing …, 2021 - proceedings.neurips.cc

Robust reinforcement learning (RL) is to find a policy that optimizes the worst-case
performance over an uncertainty set of MDPs. In this paper, we focus on model-free robust …

被引用次数：89 相关文章所有 10 个版本

[PDF] mlr.press

A finite time analysis of temporal difference learning with linear function approximation

J Bhandari, D Russo, R Singal - Conference on learning …, 2018 - proceedings.mlr.press

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value
function corresponding to a given policy in a Markov decision process. Although TD is one of …

被引用次数：396 相关文章所有 11 个版本

[PDF] nsf.gov

Global convergence of policy gradient methods to (almost) locally optimal policies

K Zhang, A Koppel, H Zhu, T Basar - SIAM Journal on Control and …, 2020 - SIAM

Policy gradient (PG) methods have been one of the most essential ingredients of
reinforcement learning, with application in a variety of domains. In spite of the empirical …

被引用次数：200 相关文章所有 10 个版本

[PDF] neurips.cc

Finite-sample analysis for sarsa with linear function approximation

S Zou, T Xu, Y Liang - Advances in neural information …, 2019 - proceedings.neurips.cc

SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement
learning. We investigate the SARSA algorithm with linear function approximation under the …

被引用次数：195 相关文章所有 9 个版本

[PDF] neurips.cc

A finite-time analysis of two time-scale actor-critic methods

YF Wu, W Zhang, P Xu, Q Gu - Advances in Neural …, 2020 - proceedings.neurips.cc

Actor-critic (AC) methods have exhibited great empirical success compared with other
reinforcement learning algorithms, where the actor uses the policy gradient to improve the …

被引用次数：139 相关文章所有 7 个版本

[PDF] github.io

On finite-time convergence of actor-critic algorithm

S Qiu, Z Yang, J Ye, Z Wang - IEEE Journal on Selected Areas …, 2021 - ieeexplore.ieee.org

Actor-critic algorithm and their extensions have made great achievements in real-world
decision-making problems. In contrast to its empirical success, the theoretical understanding …

被引用次数：76 相关文章所有 2 个版本

[PDF] mlr.press

A single-timescale method for stochastic bilevel optimization

T Chen, Y Sun, Q Xiao, W Yin - International Conference on …, 2022 - proceedings.mlr.press

Stochastic bilevel optimization generalizes the classic stochastic optimization from the
minimization of a single objective to the minimization of an objective function that depends …

被引用次数：68 相关文章所有 2 个版本

[PDF] arxiv.org

A Single-Timescale Method for Stochastic Bilevel Optimization

T Chen, Y Sun, Q Xiao, W Yin - arXiv preprint arXiv:2102.04671, 2021 - arxiv.org

Stochastic bilevel optimization generalizes the classic stochastic optimization from the
minimization of a single objective to the minimization of an objective function that depends …

被引用次数：75 相关文章所有 3 个版本

[PDF] arxiv.org

On the sample complexity of actor-critic method for reinforcement learning with function approximation

H Kumar, A Koppel, A Ribeiro - Machine Learning, 2023 - Springer

Reinforcement learning, mathematically described by Markov Decision Problems, may be
approached either through dynamic programming or policy search. Actor-critic algorithms …

被引用次数：106 相关文章所有 5 个版本

[PDF] neurips.cc

Finite-time analysis of whittle index based Q-learning for restless multi-armed bandits with neural network function approximation

G Xiong, J Li - Advances in Neural Information Processing …, 2023 - proceedings.neurips.cc

Whittle index policy is a heuristic to the intractable restless multi-armed bandits (RMAB)
problem. Although it is provably asymptotically optimal, finding Whittle indices remains …

被引用次数：6 相关文章所有 8 个版本