Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

T Xie, DJ Foster, A Krishnamurthy, C Rosset… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

M Hussing, M Kearns, A Roth, SB Sengupta… - arXiv preprint arXiv …, 2024 - arxiv.org
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both
theoretically (where worst-case sample and computational complexities must scale with …

Can we hop in general? A discussion of benchmark selection and design using the Hopper environment

CA Voelcker, M Hussing, E Eaton - Finding the Frame: An RLC Workshop … - openreview.net
While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common
practice, this choice is rarely discussed. In this paper, we present a case study on different …

Provable Partially Observable Reinforcement Learning with Privileged Information

Y Cai, X Liu, A Oikonomou, K Zhang - ICML 2024 Workshop: Aligning … - openreview.net
Partial observability of the underlying states generally presents significant challenges for
reinforcement learning (RL). In practice, certain privileged information, eg, the access to …