Delay and cooperation in nonstochastic linear bandits

Posterior sampling with delayed feedback for reinforcement learning with linear function approximation

NL Kuang, M Yin, M Wang… - Advances in Neural …, 2023 - proceedings.neurips.cc

Recent studies in reinforcement learning (RL) have made significant progress by leveraging
function approximation to alleviate the sample complexity hurdle for better performance …

被引用次数：7 相关文章所有 5 个版本

[PDF] neurips.cc

Cooperative stochastic bandits with asynchronous agents and constrained feedback

L Yang, YZJ Chen, S Pasteris… - Advances in …, 2021 - proceedings.neurips.cc

This paper studies a cooperative multi-armed bandit problem with $ M $ agents cooperating
together to solve the same instance of a $ K $-armed stochastic bandit problem with the goal …

被引用次数：31 相关文章所有 13 个版本

[PDF] neurips.cc

Near-optimal regret for adversarial mdp with delayed bandit feedback

T Jin, T Lancewicki, H Luo… - Advances in Neural …, 2022 - proceedings.neurips.cc

The standard assumption in reinforcement learning (RL) is that agents observe feedback for
their actions immediately. However, in practice feedback is often observed in delay. This …

被引用次数：26 相关文章所有 8 个版本

[PDF] mlr.press

Banker online mirror descent: A universal approach for delayed online bandit learning

J Huang, Y Dai, L Huang - International Conference on …, 2023 - proceedings.mlr.press

Abstract We propose Banker Online Mirror Descent (Banker-OMD), a novel framework
generalizing the classical Online Mirror Descent (OMD) technique in the online learning …

被引用次数：5 相关文章所有 6 个版本

[PDF] mlr.press

Delay-adapted policy optimization and improved regret for adversarial MDP with delayed bandit feedback

T Lancewicki, A Rosenberg… - … Conference on Machine …, 2023 - proceedings.mlr.press

Policy Optimization (PO) is one of the most popular methods in Reinforcement Learning
(RL). Thus, theoretical guarantees for PO algorithms have become especially important to …

被引用次数：4 相关文章所有 7 个版本

[PDF] neurips.cc

A reduction-based framework for sequential decision making with delayed feedback

Y Yang, H Zhong, T Wu, B Liu… - Advances in Neural …, 2024 - proceedings.neurips.cc

We study stochastic delayed feedback in general single-agent and multi-agent sequential
decision making, which includes bandits, single-agent Markov decision processes (MDPs) …

被引用次数：7 相关文章所有 5 个版本

[PDF] aaai.org

Stochastic contextual bandits with long horizon rewards

Y Qin, Y Li, F Pasqualetti, M Fazel… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

The growing interest in complex decision-making and language modeling problems
highlights the importance of sample-efficient learning over very long horizons. This work …

被引用次数：4 相关文章所有 7 个版本

[PDF] neurips.cc

Tight first-and second-order regret bounds for adversarial linear bandits

S Ito, S Hirahara, T Soma… - Advances in Neural …, 2020 - proceedings.neurips.cc

We propose novel algorithms with first-and second-order regret bounds for adversarial
linear bandits. These regret bounds imply that our algorithms perform well when there is an …

被引用次数：17 相关文章所有 4 个版本

[PDF] mlr.press

Dynamical linear bandits

M Mussi, AM Metelli, M Restelli - … Conference on Machine …, 2023 - proceedings.mlr.press

In many real-world sequential decision-making problems, an action does not immediately
reflect on the feedback and spreads its effects over a long time frame. For instance, in online …

被引用次数：6 相关文章所有 11 个版本

[PDF] mlr.press

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

D van der Hoeven, L Zierahn… - The Thirty Sixth …, 2023 - proceedings.mlr.press

We derive a new analysis of Follow The Regularized Leader (FTRL) for online learning with
delayed bandit feedback. By separating the cost of delayed feedback from that of bandit …

被引用次数：3 相关文章所有 6 个版本