Offline primal-dual reinforcement learning for linear mdps

G Gabbianelli, G Neu, M Papini… - … Conference on Artificial …, 2024 - proceedings.mlr.press
Abstract Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed
dataset of transitions collected by another policy. This problem has attracted a lot of attention …

Importance-weighted offline learning done right

G Gabbianelli, G Neu, M Papini - … Conference on Algorithmic …, 2024 - proceedings.mlr.press
We study the problem of offline policy optimization in stochastic contextual bandit problems,
where the goal is to learn a near-optimal policy based on a dataset of decision data …

Pure Exploration under Mediators' Feedback

R Poiani, AM Metelli, M Restelli - arXiv preprint arXiv:2308.15552, 2023 - arxiv.org
Stochastic multi-armed bandits are a sequential-decision-making framework, where, at each
interaction step, the learner selects an arm and observes a stochastic reward. Within the …

[PDF][PDF] Online Learning with Off-Policy Feedback in Adversarial MDPs

In this paper, we face the challenge of online learning in adversarial Markov decision
processes with off-policy feedback. In this setting, the learner chooses a policy, but …