Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …
Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies
Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed
the development of their theoretical foundations. Despite the huge efforts directed at the …
the development of their theoretical foundations. Despite the huge efforts directed at the …
Last-iterate convergent policy gradient primal-dual methods for constrained mdps
We study the problem of computing an optimal policy of an infinite-horizon discounted
constrained Markov decision process (constrained MDP). Despite the popularity of …
constrained Markov decision process (constrained MDP). Despite the popularity of …
A novel framework for policy mirror descent with general parameterization and linear convergence
Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe
their success to the use of parameterized policies. However, while theoretical guarantees …
their success to the use of parameterized policies. However, while theoretical guarantees …
Reinforcement learning with general utilities: Simpler variance reduction and large state-action space
We consider the reinforcement learning (RL) problem with general utilities which consists in
maximizing a function of the state-action occupancy measure. Beyond the standard …
maximizing a function of the state-action occupancy measure. Beyond the standard …
Optimal convergence rate for exact policy mirror descent in discounted markov decision processes
E Johnson, C Pike-Burke… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide
range of novel and fundamental methods in reinforcement learning. Motivated by the …
range of novel and fundamental methods in reinforcement learning. Motivated by the …
Rate-optimal policy optimization for linear markov decision processes
We study regret minimization in online episodic linear Markov Decision Processes, and
obtain rate-optimal $\widetilde O (\sqrt K) $ regret where $ K $ denotes the number of …
obtain rate-optimal $\widetilde O (\sqrt K) $ regret where $ K $ denotes the number of …
Linear convergence for natural policy gradient with log-linear policy parametrization
C Alfano, P Rebeschini - arXiv preprint arXiv:2209.15382, 2022 - arxiv.org
We analyze the convergence rate of the unregularized natural policy gradient algorithm with
log-linear policy parametrizations in infinite-horizon discounted Markov decision processes …
log-linear policy parametrizations in infinite-horizon discounted Markov decision processes …
A Fisher-Rao gradient flow for entropy-regularised Markov decision processes in Polish spaces
We study the global convergence of a Fisher-Rao policy gradient flow for infinite-horizon
entropy-regularised Markov decision processes with Polish state and action space. The flow …
entropy-regularised Markov decision processes with Polish state and action space. The flow …
Sample-efficient multi-agent rl: An optimization perspective
We study multi-agent reinforcement learning (MARL) for the general-sum Markov Games
(MGs) under the general function approximation. In order to find the minimum assumption for …
(MGs) under the general function approximation. In order to find the minimum assumption for …