Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity
Offline or batch reinforcement learning seeks to learn a near-optimal policy using history
data without active exploration of the environment. To counter the insufficient coverage and …
data without active exploration of the environment. To counter the insufficient coverage and …
Settling the sample complexity of model-based offline reinforcement learning
Settling the sample complexity of model-based offline reinforcement learning Page 1 The
Annals of Statistics 2024, Vol. 52, No. 1, 233–260 https://doi.org/10.1214/23-AOS2342 © …
Annals of Statistics 2024, Vol. 52, No. 1, 233–260 https://doi.org/10.1214/23-AOS2342 © …
Corruption-robust offline reinforcement learning with general function approximation
We investigate the problem of corruption robustness in offline reinforcement learning (RL)
with general function approximation, where an adversary can corrupt each sample in the …
with general function approximation, where an adversary can corrupt each sample in the …
Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism
Offline reinforcement learning, which seeks to utilize offline/historical data to optimize
sequential decision-making strategies, has gained surging prominence in recent studies …
sequential decision-making strategies, has gained surging prominence in recent studies …
Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game
Offline reinforcement learning (RL) aims at learning an optimal strategy using a pre-
collected dataset without further interactions with the environment. While various algorithms …
collected dataset without further interactions with the environment. While various algorithms …
Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets
We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the
goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset …
goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset …
The efficacy of pessimism in asynchronous Q-learning
This paper is concerned with the asynchronous form of Q-learning, which applies a
stochastic approximation scheme to Markovian data samples. Motivated by the recent …
stochastic approximation scheme to Markovian data samples. Motivated by the recent …
Reinforcement learning in low-rank mdps with density features
MDPs with low-rank transitions—that is, the transition matrix can be factored into the product
of two matrices, left and right—is a highly representative structure that enables tractable …
of two matrices, left and right—is a highly representative structure that enables tractable …
Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage
We study distributionally robust offline reinforcement learning (RL), which seeks to find an
optimal robust policy purely from an offline dataset that can perform well in perturbed …
optimal robust policy purely from an offline dataset that can perform well in perturbed …
When are offline two-player zero-sum Markov games solvable?
We study what dataset assumption permits solving offline two-player zero-sum Markov
games. In stark contrast to the offline single-agent Markov decision process, we show that …
games. In stark contrast to the offline single-agent Markov decision process, we show that …