Off-policy evaluation of bandit algorithm from dependent samples under batch update policy

M Kato, Y Kaneko - arXiv preprint arXiv:2010.13554, 2020 - arxiv.org
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data
obtained via a behavior policy. However, because the contextual bandit algorithm updates …

A practical guide of off-policy evaluation for bandit problems

M Kato, K Abe, K Ariu, S Yasui - arXiv preprint arXiv:2010.12470, 2020 - arxiv.org
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from
samples obtained via different policies. Recently, applying OPE methods for bandit …

Off-policy evaluation via adaptive weighting with data from contextual bandits

R Zhan, V Hadad, DA Hirshberg, S Athey - Proceedings of the 27th ACM …, 2021 - dl.acm.org
It has become increasingly common for data to be collected adaptively, for example using
contextual bandits. Historical data of this type can be used to evaluate other treatment …

Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation

Y Saito, S Aihara, M Matsutani, Y Narita - arXiv preprint arXiv:2008.07146, 2020 - arxiv.org
Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using
data generated by a different policy. Because of its huge potential impact in practice, there …

Adaptive estimator selection for off-policy evaluation

Y Su, P Srinath… - … Conference on Machine …, 2020 - proceedings.mlr.press
We develop a generic data-driven method for estimator selection in off-policy policy
evaluation settings. We establish a strong performance guarantee for the method, showing …

Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

V Liu, Y Chandak, P Thomas… - … Conference on Artificial …, 2023 - proceedings.mlr.press
In this work, we consider the off-policy policy evaluation problem for contextual bandits and
finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical …

[PDF][PDF] Large-scale open dataset, pipeline, and benchmark for bandit algorithms

Y Saito, S Aihara, M Matsutani… - arXiv preprint arXiv …, 2020 - dynamicdecisions.github.io
We build and publicize the Open Bandit Dataset to facilitate scalable and reproducible
research on bandit algorithms. It is especially suitable for off-policy evaluation (OPE), which …

[PDF][PDF] Improved estimator selection for off-policy evaluation

G Tucker, J Lee - … on Reinforcement Learning Theory at the 38th …, 2021 - jonathannlee.com
Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result,
many estimators with different tradeoffs have been developed; however, selecting the best …

Non-stationary off-policy optimization

J Hong, B Kveton, M Zaheer… - International …, 2021 - proceedings.mlr.press
Off-policy learning is a framework for evaluating and optimizing policies without deploying
them, from data collected by another policy. Real-world environments are typically non …

Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces

T Shimizu, L Forastiere - 2023 IEEE Symposium Series on …, 2023 - ieeexplore.ieee.org
We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces.
The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric …