相关文章- 学术资源搜索

Doubly robust policy evaluation and optimization

M Dudík, D Erhan, J Langford, L Li - 2014 - projecteuclid.org

We study sequential decision making in environments where rewards are only partially
observed, but can be modeled as a function of observed contexts and the chosen action by …

被引用次数：478 相关文章所有 13 个版本

[PDF] arxiv.org

A practical guide of off-policy evaluation for bandit problems

M Kato, K Abe, K Ariu, S Yasui - arXiv preprint arXiv:2010.12470, 2020 - arxiv.org

Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from
samples obtained via different policies. Recently, applying OPE methods for bandit …

被引用次数：3 相关文章所有 6 个版本

[PDF] academia.edu

A novel evaluation methodology for assessing off-policy learning methods in contextual bandits

N Hassanpour, R Greiner - … on Artificial Intelligence, Canadian AI 2018 …, 2018 - Springer

We propose a novel evaluation methodology for assessing off-policy learning methods in
contextual bandits. In particular, we provide a way to use data from any given Randomized …

被引用次数：4 相关文章所有 2 个版本

[PDF] mlr.press

More robust doubly robust off-policy evaluation

M Farajtabar, Y Chow… - … on Machine Learning, 2018 - proceedings.mlr.press

We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where
the goal is to estimate the performance of a policy from the data generated by another policy …

被引用次数：266 相关文章所有 7 个版本

[PDF] arxiv.org

Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning

O Sakhi, I Aouali, P Alquier, N Chopin - arXiv preprint arXiv:2405.14335, 2024 - arxiv.org

This work investigates the offline formulation of the contextual bandit problem, where the
goal is to leverage past interactions collected under a behavior policy to evaluate, select …

Behaviour policy estimation in off-policy policy evaluation: Calibration matters

A Raghu, O Gottesman, Y Liu, M Komorowski… - arXiv preprint arXiv …, 2018 - arxiv.org

In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy
Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of …

被引用次数：42 相关文章所有 6 个版本

[PDF] neurips.cc

Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits

MF Taufiq, A Doucet, R Cornish… - Advances in Neural …, 2024 - proceedings.neurips.cc

Abstract Off-Policy Evaluation (OPE) in contextual bandits is crucial for assessing new
policies using existing data without costly experimentation. However, current OPE methods …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Safe policy learning through extrapolation: Application to pre-trial risk assessment

E Ben-Michael, DJ Greiner, K Imai, Z Jiang - arXiv preprint arXiv …, 2021 - arxiv.org

Algorithmic recommendations and decisions have become ubiquitous in today's society.
Many of these and other data-driven policies, especially in the realm of public policy, are …

被引用次数：40 相关文章所有 2 个版本

[PDF] arxiv.org

Off-policy evaluation of bandit algorithm from dependent samples under batch update policy

M Kato, Y Kaneko - arXiv preprint arXiv:2010.13554, 2020 - arxiv.org

The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data
obtained via a behavior policy. However, because the contextual bandit algorithm updates …

被引用次数：5 相关文章所有 4 个版本

[PDF] neurips.cc

Intrinsically efficient, stable, and bounded off-policy evaluation for reinforcement learning

N Kallus, M Uehara - Advances in neural information …, 2019 - proceedings.neurips.cc

Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows
one to evaluate novel decision policies without needing to conduct exploration, which is …

被引用次数：54 相关文章所有 9 个版本