Learning adversarial markov decision processes with bandit feedback and unknown transition
We consider the task of learning in episodic finite-horizon Markov decision processes with
an unknown transition function, bandit feedback, and adversarial losses. We propose an …
an unknown transition function, bandit feedback, and adversarial losses. We propose an …
A simple and provably efficient algorithm for asynchronous federated contextual linear bandits
We study federated contextual linear bandits, where $ M $ agents cooperate with each other
to solve a global contextual linear bandit problem with the help of a central server. We …
to solve a global contextual linear bandit problem with the help of a central server. We …
Linear bandits with limited adaptivity and learning distributional optimal design
Motivated by practical needs such as large-scale learning, we study the impact of adaptivity
constraints to linear contextual bandits, a central problem in online learning and decision …
constraints to linear contextual bandits, a central problem in online learning and decision …
Provably efficient reinforcement learning with linear function approximation under adaptivity constraints
We study reinforcement learning (RL) with linear function approximation under the adaptivity
constraint. We consider two popular limited adaptivity models: the batch learning model and …
constraint. We consider two popular limited adaptivity models: the batch learning model and …
Non-stationary online learning with memory and non-stochastic control
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss
functions to depend on past decisions and thus captures temporal effects of learning …
functions to depend on past decisions and thus captures temporal effects of learning …
Recent advances in multiarmed bandits for sequential decision making
S Agrawal - Operations Research & Management Science in …, 2019 - pubsonline.informs.org
Reinforcement learning (RL) is a very general framework for making sequential decisions
when the underlying system dynamics are a priori unknown. RL algorithms use the …
when the underlying system dynamics are a priori unknown. RL algorithms use the …
Follow-the-perturbed-leader for adversarial markov decision processes with bandit feedback
We consider regret minimization for Adversarial Markov Decision Processes (AMDPs),
where the loss functions are changing over time and adversarially chosen, and the learner …
where the loss functions are changing over time and adversarially chosen, and the learner …
Online learning for adversaries with memory: price of past mistakes
The framework of online learning with memory naturally captures learning problems with
temporal effects, and was previously studied for the experts setting. In this work we extend …
temporal effects, and was previously studied for the experts setting. In this work we extend …
Non-stochastic multi-player multi-armed bandits: Optimal rate with collision information, sublinear without
We consider the non-stochastic version of the (cooperative) multi-player multi-armed bandit
problem. The model assumes no communication and no shared randomness at all between …
problem. The model assumes no communication and no shared randomness at all between …
Online switching control with stability and regret guarantees
This paper considers online switching control with a finite candidate controller pool, an
unknown dynamical system, and unknown cost functions. The candidate controllers can be …
unknown dynamical system, and unknown cost functions. The candidate controllers can be …