Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies

I Fatkhullin, A Barakat, A Kireeva… - … Conference on Machine …, 2023 - proceedings.mlr.press
Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed
the development of their theoretical foundations. Despite the huge efforts directed at the …

Linear convergence of natural policy gradient methods with log-linear policies

R Yuan, SS Du, RM Gower, A Lazaric… - arXiv preprint arXiv …, 2022 - arxiv.org
We consider infinite-horizon discounted Markov decision processes and study the
convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log …

Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward markov decision …

WU Mondal, V Aggarwal - International Conference on …, 2024 - proceedings.mlr.press
We consider the problem of designing sample efficient learning algorithms for infinite
horizon discounted reward Markov Decision Process. Specifically, we propose the …

A novel framework for policy mirror descent with general parameterization and linear convergence

C Alfano, R Yuan, P Rebeschini - Advances in Neural …, 2023 - proceedings.neurips.cc
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe
their success to the use of parameterized policies. However, while theoretical guarantees …

Optimal convergence rate for exact policy mirror descent in discounted markov decision processes

E Johnson, C Pike-Burke… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide
range of novel and fundamental methods in reinforcement learning. Motivated by the …

Performance bounds for policy-based average reward reinforcement learning algorithms

Y Murthy, M Moharrami… - Advances in Neural …, 2023 - proceedings.neurips.cc
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations
of approximate policy iteration (PI), ie, where policy improvement and policy evaluation are …

Sample-Efficient Constrained Reinforcement Learning with General Parameterization

WU Mondal, V Aggarwal - arXiv preprint arXiv:2405.10624, 2024 - arxiv.org
We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is
to maximize the expected discounted sum of rewards over an infinite horizon while ensuring …

On the Convergence of Natural Policy Gradient and Mirror Descent-Like Policy Methods for Average-Reward MDPs

Y Murthy, R Srikant - 2023 62nd IEEE Conference on Decision …, 2023 - ieeexplore.ieee.org
It is now well known that Natural Policy Gradient (NPG) globally converges for discounted-
reward MDPs in the tabular setting, with perfect value function estimates. However, the result …

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

E Anand, I Karmarkar, G Qu - arXiv preprint arXiv:2412.00661, 2024 - arxiv.org
Designing efficient algorithms for multi-agent reinforcement learning (MARL) is
fundamentally challenging due to the fact that the size of the joint state and action spaces …

Approximate Global Convergence of Independent Learning in Multi-Agent Systems

R Jin, Z Chen, Y Lin, J Song, A Wierman - arXiv preprint arXiv:2405.19811, 2024 - arxiv.org
Independent learning (IL), despite being a popular approach in practice to achieve
scalability in large-scale multi-agent systems, usually lacks global convergence guarantees …