相关文章- 学术资源搜索

Off-policy evaluation of bandit algorithm from dependent samples under batch update policy

M Kato, Y Kaneko - arXiv preprint arXiv:2010.13554, 2020 - arxiv.org

The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data
obtained via a behavior policy. However, because the contextual bandit algorithm updates …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

A practical guide of off-policy evaluation for bandit problems

M Kato, K Abe, K Ariu, S Yasui - arXiv preprint arXiv:2010.12470, 2020 - arxiv.org

Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from
samples obtained via different policies. Recently, applying OPE methods for bandit …

被引用次数：3 相关文章所有 6 个版本

[PDF] arxiv.org

Off-policy evaluation via adaptive weighting with data from contextual bandits

R Zhan, V Hadad, DA Hirshberg, S Athey - Proceedings of the 27th ACM …, 2021 - dl.acm.org

It has become increasingly common for data to be collected adaptively, for example using
contextual bandits. Historical data of this type can be used to evaluate other treatment …

被引用次数：47 相关文章所有 5 个版本

[PDF] arxiv.org

Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation

Y Saito, S Aihara, M Matsutani, Y Narita - arXiv preprint arXiv:2008.07146, 2020 - arxiv.org

Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using
data generated by a different policy. Because of its huge potential impact in practice, there …

被引用次数：67 相关文章所有 6 个版本

[PDF] mlr.press

Adaptive estimator selection for off-policy evaluation

Y Su, P Srinath… - … Conference on Machine …, 2020 - proceedings.mlr.press

We develop a generic data-driven method for estimator selection in off-policy policy
evaluation settings. We establish a strong performance guarantee for the method, showing …

被引用次数：38 相关文章所有 5 个版本

[PDF] mlr.press

Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

V Liu, Y Chandak, P Thomas… - … Conference on Artificial …, 2023 - proceedings.mlr.press

In this work, we consider the off-policy policy evaluation problem for contextual bandits and
finite horizon reinforcement learning in the nonstationary setting. Reusing old data is critical …

被引用次数：1 相关文章所有 3 个版本

[PDF] github.io

[PDF][PDF] Large-scale open dataset, pipeline, and benchmark for bandit algorithms

Y Saito, S Aihara, M Matsutani… - arXiv preprint arXiv …, 2020 - dynamicdecisions.github.io

We build and publicize the Open Bandit Dataset to facilitate scalable and reproducible
research on bandit algorithms. It is especially suitable for off-policy evaluation (OPE), which …

被引用次数：7 相关文章所有 2 个版本

[PDF] jonathannlee.com

[PDF][PDF] Improved estimator selection for off-policy evaluation

G Tucker, J Lee - … on Reinforcement Learning Theory at the 38th …, 2021 - jonathannlee.com

Off-policy policy evaluation is a fundamental problem in reinforcement learning. As a result,
many estimators with different tradeoffs have been developed; however, selecting the best …

被引用次数：7 相关文章所有 2 个版本

[PDF] mlr.press

Non-stationary off-policy optimization

J Hong, B Kveton, M Zaheer… - International …, 2021 - proceedings.mlr.press

Off-policy learning is a framework for evaluating and optimizing policies without deploying
them, from data collected by another policy. Real-world environments are typically non …

被引用次数：8 相关文章所有 5 个版本

[PDF] arxiv.org

Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces

T Shimizu, L Forastiere - 2023 IEEE Symposium Series on …, 2023 - ieeexplore.ieee.org

We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces.
The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric …

Off-policy evaluation of bandit algorithm from dependent samples under batch update policy

A practical guide of off-policy evaluation for bandit problems

Off-policy evaluation via adaptive weighting with data from contextual bandits

Open bandit dataset and pipeline: Towards realistic and reproducible off-policy evaluation

Adaptive estimator selection for off-policy evaluation

Asymptotically Unbiased Off-Policy Policy Evaluation when Reusing Old Data in Nonstationary Environments

[PDF][PDF] Large-scale open dataset, pipeline, and benchmark for bandit algorithms

[PDF][PDF] Improved estimator selection for off-policy evaluation

Non-stationary off-policy optimization

Doubly Robust Estimator for Off-Policy Evaluation with Large Action Spaces

相关搜索

高级搜索

引用