Learning adversarial markov decision processes with bandit feedback and unknown transition

C Jin, T Jin, H Luo, S Sra, T Yu - International Conference on …, 2020 - proceedings.mlr.press
We consider the task of learning in episodic finite-horizon Markov decision processes with
an unknown transition function, bandit feedback, and adversarial losses. We propose an …

A simple and provably efficient algorithm for asynchronous federated contextual linear bandits

J He, T Wang, Y Min, Q Gu - Advances in neural information …, 2022 - proceedings.neurips.cc
We study federated contextual linear bandits, where $ M $ agents cooperate with each other
to solve a global contextual linear bandit problem with the help of a central server. We …

Linear bandits with limited adaptivity and learning distributional optimal design

Y Ruan, J Yang, Y Zhou - Proceedings of the 53rd Annual ACM SIGACT …, 2021 - dl.acm.org
Motivated by practical needs such as large-scale learning, we study the impact of adaptivity
constraints to linear contextual bandits, a central problem in online learning and decision …

Provably efficient reinforcement learning with linear function approximation under adaptivity constraints

T Wang, D Zhou, Q Gu - Advances in Neural Information …, 2021 - proceedings.neurips.cc
We study reinforcement learning (RL) with linear function approximation under the adaptivity
constraint. We consider two popular limited adaptivity models: the batch learning model and …

Non-stationary online learning with memory and non-stochastic control

P Zhao, YH Yan, YX Wang, ZH Zhou - The Journal of Machine Learning …, 2023 - dl.acm.org
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss
functions to depend on past decisions and thus captures temporal effects of learning …

Recent advances in multiarmed bandits for sequential decision making

S Agrawal - Operations Research & Management Science in …, 2019 - pubsonline.informs.org
Reinforcement learning (RL) is a very general framework for making sequential decisions
when the underlying system dynamics are a priori unknown. RL algorithms use the …

Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback

Y Dai, H Luo, L Chen - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We consider regret minimization for Adversarial Markov Decision Processes (AMDPs),
where the loss functions are changing over time and adversarially chosen, and the learner …

Online learning for adversaries with memory: price of past mistakes

O Anava, E Hazan, S Mannor - Advances in Neural …, 2015 - proceedings.neurips.cc
The framework of online learning with memory naturally captures learning problems with
temporal effects, and was previously studied for the experts setting. In this work we extend …

Non-stochastic multi-player multi-armed bandits: Optimal rate with collision information, sublinear without

S Bubeck, Y Li, Y Peres… - Conference on Learning …, 2020 - proceedings.mlr.press
We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit
problem. The model assumes no communication and no shared randomness at all between …

Online switching control with stability and regret guarantees

Y Li, JA Preiss, N Li, Y Lin… - … for Dynamics and …, 2023 - proceedings.mlr.press
This paper considers online switching control with a finite candidate controller pool, an
unknown dynamical system, and unknown cost functions. The candidate controllers can be …