Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies

I Fatkhullin, A Barakat, A Kireeva… - … Conference on Machine …, 2023 - proceedings.mlr.press
Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed
the development of their theoretical foundations. Despite the huge efforts directed at the …

Last-iterate convergent policy gradient primal-dual methods for constrained mdps

D Ding, CY Wei, K Zhang… - Advances in Neural …, 2024 - proceedings.neurips.cc
We study the problem of computing an optimal policy of an infinite-horizon discounted
constrained Markov decision process (constrained MDP). Despite the popularity of …

A novel framework for policy mirror descent with general parameterization and linear convergence

C Alfano, R Yuan, P Rebeschini - Advances in Neural …, 2024 - proceedings.neurips.cc
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe
their success to the use of parameterized policies. However, while theoretical guarantees …

Reinforcement learning with general utilities: Simpler variance reduction and large state-action space

A Barakat, I Fatkhullin, N He - International Conference on …, 2023 - proceedings.mlr.press
We consider the reinforcement learning (RL) problem with general utilities which consists in
maximizing a function of the state-action occupancy measure. Beyond the standard …

Optimal convergence rate for exact policy mirror descent in discounted markov decision processes

E Johnson, C Pike-Burke… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide
range of novel and fundamental methods in reinforcement learning. Motivated by the …

Rate-optimal policy optimization for linear markov decision processes

U Sherman, A Cohen, T Koren, Y Mansour - arXiv preprint arXiv …, 2023 - arxiv.org
We study regret minimization in online episodic linear Markov Decision Processes, and
obtain rate-optimal $\widetilde O (\sqrt K) $ regret where $ K $ denotes the number of …

Linear convergence for natural policy gradient with log-linear policy parametrization

C Alfano, P Rebeschini - arXiv preprint arXiv:2209.15382, 2022 - arxiv.org
We analyze the convergence rate of the unregularized natural policy gradient algorithm with
log-linear policy parametrizations in infinite-horizon discounted Markov decision processes …

A Fisher-Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces

B Kerimkulov, JM Leahy, D Siska, L Szpruch… - arXiv preprint arXiv …, 2023 - arxiv.org
We study the global convergence of a Fisher-Rao policy gradient flow for infinite-horizon
entropy-regularised Markov decision processes with Polish state and action space. The flow …

Sample-efficient multi-agent rl: An optimization perspective

N Xiong, Z Liu, Z Wang, Z Yang - arXiv preprint arXiv:2310.06243, 2023 - arxiv.org
We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games
(MGs) under the general function approximation. In order to find the minimum assumption for …