Doubly robust policy evaluation and optimization
We study sequential decision making in environments where rewards are only partially
observed, but can be modeled as a function of observed contexts and the chosen action by …
observed, but can be modeled as a function of observed contexts and the chosen action by …
A practical guide of off-policy evaluation for bandit problems
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from
samples obtained via different policies. Recently, applying OPE methods for bandit …
samples obtained via different policies. Recently, applying OPE methods for bandit …
A novel evaluation methodology for assessing off-policy learning methods in contextual bandits
N Hassanpour, R Greiner - … on Artificial Intelligence, Canadian AI 2018 …, 2018 - Springer
We propose a novel evaluation methodology for assessing off-policy learning methods in
contextual bandits. In particular, we provide a way to use data from any given Randomized …
contextual bandits. In particular, we provide a way to use data from any given Randomized …
More robust doubly robust off-policy evaluation
M Farajtabar, Y Chow… - … on Machine Learning, 2018 - proceedings.mlr.press
We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where
the goal is to estimate the performance of a policy from the data generated by another policy …
the goal is to estimate the performance of a policy from the data generated by another policy …
Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning
This work investigates the offline formulation of the contextual bandit problem, where the
goal is to leverage past interactions collected under a behavior policy to evaluate, select …
goal is to leverage past interactions collected under a behavior policy to evaluate, select …
Behaviour policy estimation in off-policy policy evaluation: Calibration matters
In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy
Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of …
Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of …
Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits
Abstract Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new
policies using existing data without costly experimentation. However, current OPE methods …
policies using existing data without costly experimentation. However, current OPE methods …
Safe policy learning through extrapolation: Application to pre-trial risk assessment
Algorithmic recommendations and decisions have become ubiquitous in today's society.
Many of these and other data-driven policies, especially in the realm of public policy, are …
Many of these and other data-driven policies, especially in the realm of public policy, are …
Off-policy evaluation of bandit algorithm from dependent samples under batch update policy
M Kato, Y Kaneko - arXiv preprint arXiv:2010.13554, 2020 - arxiv.org
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data
obtained via a behavior policy. However, because the contextual bandit algorithm updates …
obtained via a behavior policy. However, because the contextual bandit algorithm updates …
Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning
Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows
one to evaluate novel decision policies without needing to conduct exploration, which is …
one to evaluate novel decision policies without needing to conduct exploration, which is …