Exploration is harder than prediction: Cryptographically separating reinforcement learning...

文章

学术资源搜索

获得 4 条结果（用时0.54秒）

我的图书馆

Exploration is harder than prediction: Cryptographically separating reinforcement learning...

在引用文章中搜索

[PDF] arxiv.org

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

T Xie, DJ Foster, A Krishnamurthy, C Rosset… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for
language model alignment. We consider online exploration in RLHF, which exploits …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

M Hussing, M Kearns, A Roth, SB Sengupta… - arXiv preprint arXiv …, 2024 - arxiv.org

Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both
theoretically (where worst-case sample and computational complexities must scale with …

Can we hop in general? A discussion of benchmark selection and design using the Hopper environment

CA Voelcker, M Hussing, E Eaton - Finding the Frame: An RLC Workshop … - openreview.net

While using off-the-shelf benchmarks in reinforcement learning (RL) research is a common
practice, this choice is rarely discussed. In this paper, we present a case study on different …

Provable Partially Observable Reinforcement Learning with Privileged Information

Y Cai, X Liu, A Oikonomou, K Zhang - ICML 2024 Workshop: Aligning … - openreview.net

Partial observability of the underlying states generally presents significant challenges for
reinforcement learning (RL). In practice, certain privileged information, eg, the access to …