Control variates for slate off-policy evaluation

Y Saito, Q Ren, T Joachims - international conference on …, 2023 - proceedings.mlr.press

We study off-policy evaluation (OPE) of contextual bandit policies for large discrete action
spaces where conventional importance-weighting approaches suffer from excessive …

被引用次数：15 相关文章所有 8 个版本

[PDF] arxiv.org

Off-policy evaluation of slate bandit policies via optimizing abstraction

H Kiyohara, M Nomura, Y Saito - Proceedings of the ACM on Web …, 2024 - dl.acm.org

We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a
policy selects multi-dimensional actions known as slates. This problem is widespread in …

被引用次数：3 相关文章所有 4 个版本

[PDF] mlr.press

Safe optimal design with applications in off-policy learning

R Zhu, B Kveton - International Conference on Artificial …, 2022 - proceedings.mlr.press

Motivated by practical needs in online experimentation and off-policy learning, we study the
problem of safe optimal design, where we develop a data logging policy that efficiently …

被引用次数：9 相关文章所有 2 个版本

[PDF] neurips.cc

被引用次数：6 相关文章所有 3 个版本

Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning

Q Liu, Y Li, Y Liu, K Lin, J Gao… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

In value-based deep reinforcement learning (RL), value function approximation errors lead
to suboptimal policies. Temporal difference (TD) learning is one of the most important …