Posterior sampling with delayed feedback for reinforcement learning with linear function approximation

NL Kuang, M Yin, M Wang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Recent studies in reinforcement learning (RL) have made significant progress by leveraging
function approximation to alleviate the sample complexity hurdle for better performance …

Cooperative stochastic bandits with asynchronous agents and constrained feedback

L Yang, YZJ Chen, S Pasteris… - Advances in …, 2021 - proceedings.neurips.cc
This paper studies a cooperative multi-armed bandit problem with $ M $ agents cooperating
together to solve the same instance of a $ K $-armed stochastic bandit problem with the goal …

Near-optimal regret for adversarial mdp with delayed bandit feedback

T Jin, T Lancewicki, H Luo… - Advances in Neural …, 2022 - proceedings.neurips.cc
The standard assumption in reinforcement learning (RL) is that agents observe feedback for
their actions immediately. However, in practice feedback is often observed in delay. This …

Banker online mirror descent: A universal approach for delayed online bandit learning

J Huang, Y Dai, L Huang - International Conference on …, 2023 - proceedings.mlr.press
Abstract We propose Banker Online Mirror Descent (Banker-OMD), a novel framework
generalizing the classical Online Mirror Descent (OMD) technique in the online learning …

Delay-adapted policy optimization and improved regret for adversarial MDP with delayed bandit feedback

T Lancewicki, A Rosenberg… - … Conference on Machine …, 2023 - proceedings.mlr.press
Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning
(RL). Thus, theoretical guarantees for PO algorithms have become especially important to …

A reduction-based framework for sequential decision making with delayed feedback

Y Yang, H Zhong, T Wu, B Liu… - Advances in Neural …, 2024 - proceedings.neurips.cc
We study stochastic delayed feedback in general single-agent and multi-agent sequential
decision making, which includes bandits, single-agent Markov decision processes (MDPs) …

Stochastic contextual bandits with long horizon rewards

Y Qin, Y Li, F Pasqualetti, M Fazel… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
The growing interest in complex decision-making and language modeling problems
highlights the importance of sample-efficient learning over very long horizons. This work …

Tight first-and second-order regret bounds for adversarial linear bandits

S Ito, S Hirahara, T Soma… - Advances in Neural …, 2020 - proceedings.neurips.cc
We propose novel algorithms with first-and second-order regret bounds for adversarial
linear bandits. These regret bounds imply that our algorithms perform well when there is an …

Dynamical linear bandits

M Mussi, AM Metelli, M Restelli - … Conference on Machine …, 2023 - proceedings.mlr.press
In many real-world sequential decision-making problems, an action does not immediately
reflect on the feedback and spreads its effects over a long time frame. For instance, in online …

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

D van der Hoeven, L Zierahn… - The Thirty Sixth …, 2023 - proceedings.mlr.press
We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with
delayed bandit feedback. By separating the cost of delayed feedback from that of bandit …