Doubly robust policy evaluation and optimization

M Dudík, D Erhan, J Langford, L Li - 2014 - projecteuclid.org
We study sequential decision making in environments where rewards are only partially
observed, but can be modeled as a function of observed contexts and the chosen action by …

A practical guide of off-policy evaluation for bandit problems

M Kato, K Abe, K Ariu, S Yasui - arXiv preprint arXiv:2010.12470, 2020 - arxiv.org
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from
samples obtained via different policies. Recently, applying OPE methods for bandit …

A novel evaluation methodology for assessing off-policy learning methods in contextual bandits

N Hassanpour, R Greiner - … on Artificial Intelligence, Canadian AI 2018 …, 2018 - Springer
We propose a novel evaluation methodology for assessing off-policy learning methods in
contextual bandits. In particular, we provide a way to use data from any given Randomized …

More robust doubly robust off-policy evaluation

M Farajtabar, Y Chow… - … on Machine Learning, 2018 - proceedings.mlr.press
We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where
the goal is to estimate the performance of a policy from the data generated by another policy …

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

O Sakhi, I Aouali, P Alquier, N Chopin - arXiv preprint arXiv:2405.14335, 2024 - arxiv.org
This work investigates the offline formulation of the contextual bandit problem, where the
goal is to leverage past interactions collected under a behavior policy to evaluate, select …

Behaviour policy estimation in off-policy policy evaluation: Calibration matters

A Raghu, O Gottesman, Y Liu, M Komorowski… - arXiv preprint arXiv …, 2018 - arxiv.org
In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy
Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of …

Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits

MF Taufiq, A Doucet, R Cornish… - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new
policies using existing data without costly experimentation. However, current OPE methods …

Safe policy learning through extrapolation: Application to pre-trial risk assessment

E Ben-Michael, DJ Greiner, K Imai, Z Jiang - arXiv preprint arXiv …, 2021 - arxiv.org
Algorithmic recommendations and decisions have become ubiquitous in today's society.
Many of these and other data-driven policies, especially in the realm of public policy, are …

Off-policy evaluation of bandit algorithm from dependent samples under batch update policy

M Kato, Y Kaneko - arXiv preprint arXiv:2010.13554, 2020 - arxiv.org
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data
obtained via a behavior policy. However, because the contextual bandit algorithm updates …

Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning

N Kallus, M Uehara - Advances in neural information …, 2019 - proceedings.neurips.cc
Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows
one to evaluate novel decision policies without needing to conduct exploration, which is …