Dealing with misspecification in fixed-confidence linear top-m identification

C Réda, S Vakili, E Kaufmann - Advances in Neural …, 2022 - proceedings.neurips.cc

This paper introduces a general multi-agent bandit model in which each agent is facing a
finite set of arms and may communicate with other agents through a central controller in …

被引用次数：19 相关文章所有 11 个版本

[PDF] mlr.press

Active coverage for pac reinforcement learning

A Al-Marjani, A Tirinzoni… - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press

Collecting and leveraging data with good coverage properties plays a crucial role in different
aspects of reinforcement learning (RL), including reward-free exploration and offline …

被引用次数：7 相关文章所有 8 个版本

[PDF] mlr.press

Information-directed selection for top-two algorithms

W You, C Qin, Z Wang, S Yang - The Thirty Sixth Annual …, 2023 - proceedings.mlr.press

We consider the best-k-arm identification problem for multi-armed bandits, where the
objective is to select the exact set of k arms with the highest mean rewards by sequentially …

被引用次数：11 相关文章所有 3 个版本

[PDF] neurips.cc

On elimination strategies for bandit fixed-confidence identification

A Tirinzoni, R Degenne - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Elimination algorithms for bandit identification, which prune the plausible correct answers
sequentially until only one remains, are computationally convenient since they reduce the …

被引用次数：6 相关文章所有 10 个版本

[PDF] jmlr.org

[PDF][PDF] Optimal clustering with bandit feedback

J Yang, Z Zhong, VYF Tan - Journal of Machine Learning Research, 2024 - jmlr.org

This paper considers the problem of online clustering with bandit feedback. A set of arms (or
items) can be partitioned into various groups that are unknown. Within each group, the …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

Towards instance-optimality in online pac reinforcement learning

A Al-Marjani, A Tirinzoni, E Kaufmann - arXiv preprint arXiv:2311.05638, 2023 - arxiv.org

Several recent works have proposed instance-dependent upper bounds on the number of
episodes needed to identify, with probability $1-\delta $, an $\varepsilon $-optimal policy in …

被引用次数：2 相关文章所有 6 个版本

[PDF] mlr.press

Optimal Regret Bounds for Collaborative Learning in Bandits

A Shidani, S Vakili - International Conference on Algorithmic …, 2024 - proceedings.mlr.press

We consider regret minimization in a general collaborative multi-agent multi-armed bandit
model, in which each agent faces a finite set of arms and may communicate with other …

On the complexity of representation learning in contextual linear bandits

A Tirinzoni, M Pirotta, A Lazaric - … Conference on Artificial …, 2023 - proceedings.mlr.press

In contextual linear bandits, the reward function is assumed to be a linear combination of an
unknown reward vector and a given embedding of context-arm pairs. In practice, the …

被引用次数：1 相关文章所有 3 个版本

[PDF] mlr.press

Choosing Answers in Epsilon-Best-Answer Identification for Linear Bandits

M Jourdan, R Degenne - International Conference on …, 2022 - proceedings.mlr.press

In pure-exploration problems, information is gathered sequentially to answer a question on
the stochastic environment. While best-arm identification for linear bandits has been …

被引用次数：2 相关文章所有 6 个版本

[PDF] arxiv.org

Dual-Directed Algorithm Design for Efficient Pure Exploration

C Qin, W You - arXiv preprint arXiv:2310.19319, 2023 - arxiv.org

We consider pure-exploration problems in the context of stochastic sequential adaptive
experiments with a finite set of alternative options. The goal of the decision-maker is to …