[图书][B] Bandit algorithms
T Lattimore, C Szepesvári - 2020 - books.google.com
Decision-making in the face of uncertainty is a significant challenge in machine learning,
and the multi-armed bandit model is a commonly used framework to address it. This …
and the multi-armed bandit model is a commonly used framework to address it. This …
Exploration-exploitation in constrained mdps
In many sequential decision-making problems, the goal is to optimize a utility function while
satisfying a set of constraints on different utilities. This learning problem is formalized …
satisfying a set of constraints on different utilities. This learning problem is formalized …
Learning policies with zero or bounded constraint violation for constrained mdps
We address the issue of safety in reinforcement learning. We pose the problem in an
episodic framework of a constrained Markov decision process. Existing results have shown …
episodic framework of a constrained Markov decision process. Existing results have shown …
Mostly exploration-free algorithms for contextual bandits
The contextual bandit literature has traditionally focused on algorithms that address the
exploration–exploitation tradeoff. In particular, greedy algorithms that exploit current …
exploration–exploitation tradeoff. In particular, greedy algorithms that exploit current …
Linear stochastic bandits under safety constraints
S Amani, M Alizadeh… - Advances in Neural …, 2019 - proceedings.neurips.cc
Bandit algorithms have various application in safety-critical systems, where it is important to
respect the system constraints that rely on the bandit's unknown parameters at every round …
respect the system constraints that rely on the bandit's unknown parameters at every round …
Stochastic bandits with linear constraints
A Pacchiano, M Ghavamzadeh… - International …, 2021 - proceedings.mlr.press
We study a constrained contextual linear bandit setting, where the goal of the agent is to
produce a sequence of policies, whose expected cumulative reward over the course of …
produce a sequence of policies, whose expected cumulative reward over the course of …
Regret minimization with performative feedback
M Jagadeesan, T Zrnic… - … on Machine Learning, 2022 - proceedings.mlr.press
In performative prediction, the deployment of a predictive model triggers a shift in the data
distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy …
distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy …
On kernelized multi-armed bandits with constraints
We study a stochastic bandit problem with a general unknown reward function and a
general unknown constraint function. Both functions can be non-linear (even non-convex) …
general unknown constraint function. Both functions can be non-linear (even non-convex) …
An efficient pessimistic-optimistic algorithm for stochastic linear bandits with general constraints
This paper considers stochastic linear bandits with general nonlinear constraints. The
objective is to maximize the expected cumulative reward over horizon $ T $ subject to a set …
objective is to maximize the expected cumulative reward over horizon $ T $ subject to a set …
Offline contextual bandits with high probability fairness guarantees
We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad
family of fairness constraints. Our algorithm accepts multiple fairness definitions and allows …
family of fairness constraints. Our algorithm accepts multiple fairness definitions and allows …