When is agnostic reinforcement learning statistically tractable?

Z Jia, G Li, A Rakhlin, A Sekhari… - Advances in Neural …, 2024 - proceedings.neurips.cc
We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $\Pi
$, how many rounds of interaction with an unknown MDP (with a potentially large state and …

The Value of Reward Lookahead in Reinforcement Learning

N Merlis, D Baudry, V Perchet - arXiv preprint arXiv:2403.11637, 2024 - arxiv.org
In reinforcement learning (RL), agents sequentially interact with changing environments
while aiming to maximize the obtained rewards. Usually, rewards are observed only after …

Towards instance-optimality in online pac reinforcement learning

A Al-Marjani, A Tirinzoni, E Kaufmann - arXiv preprint arXiv:2311.05638, 2023 - arxiv.org
Several recent works have proposed instance-dependent upper bounds on the number of
episodes needed to identify, with probability $1-\delta $, an $\varepsilon $-optimal policy in …

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

J Kwon, S Mannor, C Caramanis, Y Efroni - arXiv preprint arXiv …, 2024 - arxiv.org
In many real-world decision problems there is partially observed, hidden or latent
information that remains fixed throughout an interaction. Such decision problems can be …

Offline Contextual Bandit: Theory and Large Scale Applications

O Sakhi - 2023 - theses.hal.science
This thesis presents contributions to the problem of learning from logged interactions using
the offline contextual bandit framework. We are interested in two related topics:(1) offline …

The impact of data distribution on Q-learning with function approximation

PP Santos, DS Carvalho, A Sardinha, FS Melo - Machine Learning, 2024 - Springer
We study the interplay between the data distribution and Q-learning-based algorithms with
function approximation. We provide a unified theoretical and empirical analysis as to how …

Adaptive Pure Exploration in Markov Decision Processes and Bandits

A Al Marjani - 2023 - theses.hal.science
This thesis studies pure exploration problems in Markov Decision Processes (MDP) and
Multi-Armed Bandits. These problems have mainly been studied in a “worst-case” …