A model selection approach for corruption robust reinforcement learning
We develop a model selection approach to tackle reinforcement learning with adversarial
corruption in both transition and reward. For finite-horizon tabular MDPs, without prior …
corruption in both transition and reward. For finite-horizon tabular MDPs, without prior …
Refined regret for adversarial mdps with linear function approximation
We consider learning in an adversarial Markov Decision Process (MDP) where the loss
functions can change arbitrarily over $ K $ episodes and the state space can be arbitrarily …
functions can change arbitrarily over $ K $ episodes and the state space can be arbitrarily …
Improved best-of-both-worlds guarantees for multi-armed bandits: FTRL with general regularizers and multiple optimal arms
We study the problem of designing adaptive multi-armed bandit algorithms that perform
optimally in both the stochastic setting and the adversarial setting simultaneously (often …
optimally in both the stochastic setting and the adversarial setting simultaneously (often …
Nearly optimal best-of-both-worlds algorithms for online learning with feedback graphs
This study considers online learning with general directed feedback graphs. For this
problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds …
problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds …
A blackbox approach to best of both worlds in bandits and beyond
Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both
the adversarial and the stochastic regimes have received growing attention recently …
the adversarial and the stochastic regimes have received growing attention recently …
Best of both worlds policy optimization
Policy optimization methods are popular reinforcement learning algorithms in practice and
recent works have build theoretical foundation for them by proving $\sqrt {T} $ regret bounds …
recent works have build theoretical foundation for them by proving $\sqrt {T} $ regret bounds …
The best of both worlds: stochastic and adversarial episodic mdps with unknown transition
We consider the best-of-both-worlds problem for learning an episodic Markov Decision
Process through $ T $ episodes, with the goal of achieving $\widetilde {\mathcal {O}}(\sqrt …
Process through $ T $ episodes, with the goal of achieving $\widetilde {\mathcal {O}}(\sqrt …
Stability-penalty-adaptive follow-the-regularized-leader: Sparsity, game-dependency, and best-of-both-worlds
Adaptivity to the difficulties of a problem is a key property in sequential decision-making
problems to broaden the applicability of algorithms. Follow-the-regularized-leader (FTRL) …
problems to broaden the applicability of algorithms. Follow-the-regularized-leader (FTRL) …
A best-of-both-worlds algorithm for bandits with delayed feedback
S Masoudian, J Zimmert… - Advances in Neural …, 2022 - proceedings.neurips.cc
We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial
multiarmed bandits with delayed feedback, which in addition to the minimax optimal …
multiarmed bandits with delayed feedback, which in addition to the minimax optimal …
No-regret online reinforcement learning with adversarial losses and transitions
Existing online learning algorithms for adversarial Markov Decision Processes achieve
$\mathcal {O}(\sqrt {T}) $ regret after $ T $ rounds of interactions even if the loss functions …
$\mathcal {O}(\sqrt {T}) $ regret after $ T $ rounds of interactions even if the loss functions …