Posterior sampling with delayed feedback for reinforcement learning with linear function approximation
Recent studies in reinforcement learning (RL) have made significant progress by leveraging
function approximation to alleviate the sample complexity hurdle for better performance …
function approximation to alleviate the sample complexity hurdle for better performance …
Cooperative stochastic bandits with asynchronous agents and constrained feedback
This paper studies a cooperative multi-armed bandit problem with $ M $ agents cooperating
together to solve the same instance of a $ K $-armed stochastic bandit problem with the goal …
together to solve the same instance of a $ K $-armed stochastic bandit problem with the goal …
Near-optimal regret for adversarial mdp with delayed bandit feedback
The standard assumption in reinforcement learning (RL) is that agents observe feedback for
their actions immediately. However, in practice feedback is often observed in delay. This …
their actions immediately. However, in practice feedback is often observed in delay. This …
Banker online mirror descent: A universal approach for delayed online bandit learning
Abstract We propose Banker Online Mirror Descent (Banker-OMD), a novel framework
generalizing the classical Online Mirror Descent (OMD) technique in the online learning …
generalizing the classical Online Mirror Descent (OMD) technique in the online learning …
Delay-adapted policy optimization and improved regret for adversarial MDP with delayed bandit feedback
T Lancewicki, A Rosenberg… - … Conference on Machine …, 2023 - proceedings.mlr.press
Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning
(RL). Thus, theoretical guarantees for PO algorithms have become especially important to …
(RL). Thus, theoretical guarantees for PO algorithms have become especially important to …
A reduction-based framework for sequential decision making with delayed feedback
We study stochastic delayed feedback in general single-agent and multi-agent sequential
decision making, which includes bandits, single-agent Markov decision processes (MDPs) …
decision making, which includes bandits, single-agent Markov decision processes (MDPs) …
Stochastic contextual bandits with long horizon rewards
The growing interest in complex decision-making and language modeling problems
highlights the importance of sample-efficient learning over very long horizons. This work …
highlights the importance of sample-efficient learning over very long horizons. This work …
Tight first-and second-order regret bounds for adversarial linear bandits
We propose novel algorithms with first-and second-order regret bounds for adversarial
linear bandits. These regret bounds imply that our algorithms perform well when there is an …
linear bandits. These regret bounds imply that our algorithms perform well when there is an …
Dynamical linear bandits
In many real-world sequential decision-making problems, an action does not immediately
reflect on the feedback and spreads its effects over a long time frame. For instance, in online …
reflect on the feedback and spreads its effects over a long time frame. For instance, in online …
A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs
D van der Hoeven, L Zierahn… - The Thirty Sixth …, 2023 - proceedings.mlr.press
We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with
delayed bandit feedback. By separating the cost of delayed feedback from that of bandit …
delayed bandit feedback. By separating the cost of delayed feedback from that of bandit …