Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …
language model alignment. We consider online exploration in RLHF, which exploits …
Oracle-Efficient Reinforcement Learning for Max Value Ensembles
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both
theoretically (where worst-case sample and computational complexities must scale with …
theoretically (where worst-case sample and computational complexities must scale with …
Can we hop in general? A discussion of benchmark selection and design using the Hopper environment
While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common
practice, this choice is rarely discussed. In this paper, we present a case study on different …
practice, this choice is rarely discussed. In this paper, we present a case study on different …
Provable Partially Observable Reinforcement Learning with Privileged Information
Y Cai, X Liu, A Oikonomou, K Zhang - ICML 2024 Workshop: Aligning … - openreview.net
Partial observability of the underlying states generally presents significant challenges for
reinforcement learning (RL). In practice, certain privileged information, eg, the access to …
reinforcement learning (RL). In practice, certain privileged information, eg, the access to …