Offline primal-dual reinforcement learning for linear mdps
Abstract Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed
dataset of transitions collected by another policy. This problem has attracted a lot of attention …
dataset of transitions collected by another policy. This problem has attracted a lot of attention …
Importance-weighted offline learning done right
We study the problem of offline policy optimization in stochastic contextual bandit problems,
where the goal is to learn a near-optimal policy based on a dataset of decision data …
where the goal is to learn a near-optimal policy based on a dataset of decision data …
Pure Exploration under Mediators' Feedback
Stochastic multi-armed bandits are a sequential-decision-making framework, where, at each
interaction step, the learner selects an arm and observes a stochastic reward. Within the …
interaction step, the learner selects an arm and observes a stochastic reward. Within the …
[PDF][PDF] Online Learning with Off-Policy Feedback in Adversarial MDPs
In this paper, we face the challenge of online learning in adversarial Markov decision
processes with off-policy feedback. In this setting, the learner chooses a policy, but …
processes with off-policy feedback. In this setting, the learner chooses a policy, but …