Eluder dimension and the sample complexity of optimistic exploration

DJ Russo, B Van Roy, A Kazerouni… - … and Trends® in …, 2018 - nowpublishers.com

Thompson sampling is an algorithm for online decision problems where actions are taken
sequentially in a manner that must balance between exploiting what is known to maximize …

被引用次数：1272 相关文章所有 34 个版本

[PDF] mlr.press

Is pessimism provably efficient for offline rl?

Y Jin, Z Yang, Z Wang - International Conference on …, 2021 - proceedings.mlr.press

We study offline reinforcement learning (RL), which aims to learn an optimal policy based on
a dataset collected a priori. Due to the lack of further interactions with the environment …

被引用次数：446 相关文章所有 7 个版本

[PDF] arxiv.org

The statistical complexity of interactive decision making

DJ Foster, SM Kakade, J Qian, A Rakhlin - arXiv preprint arXiv:2112.13487, 2021 - arxiv.org

A fundamental challenge in interactive learning and decision making, ranging from bandit
problems to reinforcement learning, is to provide sample-efficient, adaptive learning …

被引用次数：203 相关文章所有 6 个版本

[PDF] neurips.cc

Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms

C Jin, Q Liu, S Miryoosefi - Advances in neural information …, 2021 - proceedings.neurips.cc

Finding the minimal structural assumptions that empower sample-efficient learning is one of
the most important research directions in Reinforcement Learning (RL). This paper …

被引用次数：263 相关文章所有 11 个版本

[PDF] mlr.press

When is partially observable reinforcement learning not scary?

Q Liu, A Chung, C Szepesvári… - Conference on Learning …, 2022 - proceedings.mlr.press

Partial observability is ubiquitous in applications of Reinforcement Learning (RL), in which
agents learn to make a sequence of decisions despite lacking complete information about …

被引用次数：110 相关文章所有 7 个版本

[PDF] tor-lattimore.com

[图书][B] Bandit algorithms

T Lattimore, C Szepesvári - 2020 - books.google.com

Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …

被引用次数：3240 相关文章所有 9 个版本

[PDF] neurips.cc

Is rlhf more difficult than standard rl? a theoretical perspective

Y Wang, Q Liu, C Jin - Advances in Neural Information …, 2023 - proceedings.neurips.cc

Abstract Reinforcement learning from Human Feedback (RLHF) learns from preference
signals, while standard Reinforcement Learning (RL) directly learns from reward signals …

被引用次数：24 相关文章所有 4 个版本

[PDF] neurips.cc

Flambe: Structural complexity and representation learning of low rank mdps

A Agarwal, S Kakade… - Advances in neural …, 2020 - proceedings.neurips.cc

In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common
practice to make parametric assumptions where values or policies are functions of some low …

被引用次数：293 相关文章所有 10 个版本

[PDF] mlr.press

Contextual decision processes with low bellman rank are pac-learnable

N Jiang, A Krishnamurthy, A Agarwal… - International …, 2017 - proceedings.mlr.press

This paper studies systematic exploration for reinforcement learning (RL) with rich
observations and function approximation. We introduce contextual decision processes …

被引用次数：507 相关文章所有 11 个版本

[PDF] openreview.net

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint

W Xiong, H Dong, C Ye, Z Wang, H Zhong… - … on Machine Learning, 2024 - openreview.net

This paper studies the theoretical framework of the alignment process of generative models
with Reinforcement Learning from Human Feedback (RLHF). We consider a standard …

被引用次数：67 相关文章所有 3 个版本