DOPE: Doubly optimistic and pessimistic exploration for safe reinforcement learning
A Bura, A HasanzadeZonuzy… - Advances in neural …, 2022 - proceedings.neurips.cc
Safe reinforcement learning is extremely challenging--not only must the agent explore an
unknown environment, it must do so while ensuring no safety constraint violations. We …
unknown environment, it must do so while ensuring no safety constraint violations. We …
Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation
This paper presents the first model-free, simulator-free reinforcement learning algorithm for
Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint …
Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint …
A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes
This paper presents a model-free reinforcement learning (RL) algorithm for infinite-horizon
average-reward Constrained Markov Decision Processes (CMDPs). Considering a learning …
average-reward Constrained Markov Decision Processes (CMDPs). Considering a learning …
On kernelized multi-armed bandits with constraints
We study a stochastic bandit problem with a general unknown reward function and a
general unknown constraint function. Both functions can be non-linear (even non-convex) …
general unknown constraint function. Both functions can be non-linear (even non-convex) …
Combinatorial bandits with linear constraints: Beyond knapsacks and fairness
This paper proposes and studies for the first time the problem of combinatorial multi-armed
bandits with linear long-term constraints. Our model generalizes and unifies several …
bandits with linear long-term constraints. Our model generalizes and unifies several …
Learning to schedule tasks with deadline and throughput constraints
We consider the task scheduling scenario where the controller activates one from K task
types at each time. Each task induces a random completion time, and a reward is obtained …
types at each time. Each task induces a random completion time, and a reward is obtained …
Learning while scheduling in multi-server systems with unknown statistics: Maxweight with discounted ucb
Multi-server queueing systems are widely used models for job scheduling in machine
learning, wireless networks, and crowdsourcing. This paper considers a multi-server system …
learning, wireless networks, and crowdsourcing. This paper considers a multi-server system …
Learning-based scheduling for information gathering with QoS constraints
The problem of scheduling packets from multiple sources over unreliable channels has
attracted much attention due to its great practicability in the Internet of things systems. Most …
attracted much attention due to its great practicability in the Internet of things systems. Most …
Safe learning in tree-form sequential decision making: Handling hard and soft constraints
M Bernasconi, F Cacciamani… - International …, 2022 - proceedings.mlr.press
We study decision making problems in which an agent sequentially interacts with a
stochastic environment defined by means of a tree structure. The agent repeatedly faces the …
stochastic environment defined by means of a tree structure. The agent repeatedly faces the …
Optimization of offloading policies for accuracy-delay tradeoffs in hierarchical inference
We consider a hierarchical inference system with multiple clients connected to a server via a
shared communication resource. When necessary, clients with low-accuracy machine …
shared communication resource. When necessary, clients with low-accuracy machine …