Linear convergence of natural policy gradient methods with log-linear policies

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：21 相关文章所有 3 个版本

[PDF] mlr.press

Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies

I Fatkhullin, A Barakat, A Kireeva… - … Conference on Machine …, 2023 - proceedings.mlr.press

Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed
the development of their theoretical foundations. Despite the huge efforts directed at the …

被引用次数：30 相关文章所有 8 个版本

[PDF] neurips.cc

Last-iterate convergent policy gradient primal-dual methods for constrained mdps

D Ding, CY Wei, K Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc

We study the problem of computing an optimal policy of an infinite-horizon discounted
constrained Markov decision process (constrained MDP). Despite the popularity of …

被引用次数：12 相关文章所有 6 个版本

[PDF] neurips.cc

A novel framework for policy mirror descent with general parameterization and linear convergence

C Alfano, R Yuan, P Rebeschini - Advances in Neural …, 2024 - proceedings.neurips.cc

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe
their success to the use of parameterized policies. However, while theoretical guarantees …

被引用次数：16 相关文章所有 8 个版本

[PDF] mlr.press

Reinforcement learning with general utilities: Simpler variance reduction and large state-action space

A Barakat, I Fatkhullin, N He - International Conference on …, 2023 - proceedings.mlr.press

We consider the reinforcement learning (RL) problem with general utilities which consists in
maximizing a function of the state-action occupancy measure. Beyond the standard …

被引用次数：8 相关文章所有 7 个版本

[PDF] neurips.cc

Optimal convergence rate for exact policy mirror descent in discounted markov decision processes

E Johnson, C Pike-Burke… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide
range of novel and fundamental methods in reinforcement learning. Motivated by the …

被引用次数：8 相关文章所有 8 个版本

[PDF] arxiv.org

Rate-optimal policy optimization for linear markov decision processes

U Sherman, A Cohen, T Koren, Y Mansour - arXiv preprint arXiv …, 2023 - arxiv.org

We study regret minimization in online episodic linear Markov Decision Processes, and
obtain rate-optimal $\widetilde O (\sqrt K) $ regret where $ K $ denotes the number of …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Linear convergence for natural policy gradient with log-linear policy parametrization

C Alfano, P Rebeschini - arXiv preprint arXiv:2209.15382, 2022 - arxiv.org

We analyze the convergence rate of the unregularized natural policy gradient algorithm with
log-linear policy parametrizations in infinite-horizon discounted Markov decision processes …

被引用次数：13 相关文章所有 3 个版本

[PDF] arxiv.org

A Fisher-Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces

B Kerimkulov, JM Leahy, D Siska, L Szpruch… - arXiv preprint arXiv …, 2023 - arxiv.org

We study the global convergence of a Fisher-Rao policy gradient flow for infinite-horizon
entropy-regularised Markov decision processes with Polish state and action space. The flow …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Sample-efficient multi-agent rl: An optimization perspective

N Xiong, Z Liu, Z Wang, Z Yang - arXiv preprint arXiv:2310.06243, 2023 - arxiv.org

We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games
(MGs) under the general function approximation. In order to find the minimum assumption for …

被引用次数：2 相关文章所有 3 个版本