Off-policy evaluation of bandit algorithm from dependent samples under batch update policy
M Kato, Y Kaneko - arXiv preprint arXiv:2010.13554, 2020 - arxiv.org
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data
obtained via a behavior policy. However, because the contextual bandit algorithm updates …
obtained via a behavior policy. However, because the contextual bandit algorithm updates …
A practical guide of off-policy evaluation for bandit problems
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from
samples obtained via different policies. Recently, applying OPE methods for bandit …
samples obtained via different policies. Recently, applying OPE methods for bandit …
Off-policy evaluation via adaptive weighting with data from contextual bandits
It has become increasingly common for data to be collected adaptively, for example using
contextual bandits. Historical data of this type can be used to evaluate other treatment …
contextual bandits. Historical data of this type can be used to evaluate other treatment …
Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation
Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using
data generated by a different policy. Because of its huge potential impact in practice, there …
data generated by a different policy. Because of its huge potential impact in practice, there …
Adaptive estimator selection for off-policy evaluation
Y Su, P Srinath… - … Conference on Machine …, 2020 - proceedings.mlr.press
We develop a generic data-driven method for estimator selection in off-policy policy
evaluation settings. We establish a strong performance guarantee for the method, showing …
evaluation settings. We establish a strong performance guarantee for the method, showing …
Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments
In this work, we consider the off-policy policy evaluation problem for contextual bandits and
finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical …
finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical …
[PDF][PDF] Large-scale open dataset, pipeline, and benchmark for bandit algorithms
Y Saito, S Aihara, M Matsutani… - arXiv preprint arXiv …, 2020 - dynamicdecisions.github.io
We build and publicize the Open Bandit Dataset to facilitate scalable and reproducible
research on bandit algorithms. It is especially suitable for off-policy evaluation (OPE), which …
research on bandit algorithms. It is especially suitable for off-policy evaluation (OPE), which …
[PDF][PDF] Improved estimator selection for off-policy evaluation
Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result,
many estimators with different tradeoffs have been developed; however, selecting the best …
many estimators with different tradeoffs have been developed; however, selecting the best …
Non-stationary off-policy optimization
Off-policy learning is a framework for evaluating and optimizing policies without deploying
them, from data collected by another policy. Real-world environments are typically non …
them, from data collected by another policy. Real-world environments are typically non …
Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces
T Shimizu, L Forastiere - 2023 IEEE Symposium Series on …, 2023 - ieeexplore.ieee.org
We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces.
The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric …
The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric …