Off-policy evaluation for large action spaces via conjunct effect modeling

Y Saito, Q Ren, T Joachims - international conference on …, 2023 - proceedings.mlr.press
We study off-policy evaluation (OPE) of contextual bandit policies for large discrete action
spaces where conventional importance-weighting approaches suffer from excessive …

Off-policy evaluation of slate bandit policies via optimizing abstraction

H Kiyohara, M Nomura, Y Saito - Proceedings of the ACM on Web …, 2024 - dl.acm.org
We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a
policy selects multi-dimensional actions known as slates. This problem is widespread in …

Safe optimal design with applications in off-policy learning

R Zhu, B Kveton - International Conference on Artificial …, 2022 - proceedings.mlr.press
Motivated by practical needs in online experimentation and off-policy learning, we study the
problem of safe optimal design, where we develop a data logging policy that efficiently …

Exploiting correlated auxiliary feedback in parameterized bandits

A Verma, Z Dai, Y Shu… - Advances in Neural …, 2024 - proceedings.neurips.cc
We study a novel variant of the parameterized bandits problem in which the learner can
observe additional auxiliary feedback that is correlated with the observed reward. The …

Distributional Off-Policy Evaluation for Slate Recommendations

S Chaudhari, D Arbour, G Theocharous… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Recommendation strategies are typically evaluated by using previously logged data,
employing off-policy evaluation methods to estimate their expected performance. However …

Safe data collection for offline and online policy learning

R Zhu, B Kveton - arXiv preprint arXiv:2111.04835, 2021 - arxiv.org
Motivated by practical needs of experimentation and policy learning in online platforms, we
study the problem of safe data collection. Specifically, our goal is to develop a logging policy …

Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning

Q Liu, Y Li, Y Liu, K Lin, J Gao… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In value-based deep reinforcement learning (RL), value function approximation errors lead
to suboptimal policies. Temporal difference (TD) learning is one of the most important …