Off-policy evaluation for large action spaces via conjunct effect modeling
We study off-policy evaluation (OPE) of contextual bandit policies for large discrete action
spaces where conventional importance-weighting approaches suffer from excessive …
spaces where conventional importance-weighting approaches suffer from excessive …
Off-policy evaluation of slate bandit policies via optimizing abstraction
We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a
policy selects multi-dimensional actions known as slates. This problem is widespread in …
policy selects multi-dimensional actions known as slates. This problem is widespread in …
Safe optimal design with applications in off-policy learning
Motivated by practical needs in online experimentation and off-policy learning, we study the
problem of safe optimal design, where we develop a data logging policy that efficiently …
problem of safe optimal design, where we develop a data logging policy that efficiently …
Exploiting correlated auxiliary feedback in parameterized bandits
We study a novel variant of the parameterized bandits problem in which the learner can
observe additional auxiliary feedback that is correlated with the observed reward. The …
observe additional auxiliary feedback that is correlated with the observed reward. The …
Distributional Off-Policy Evaluation for Slate Recommendations
Recommendation strategies are typically evaluated by using previously logged data,
employing off-policy evaluation methods to estimate their expected performance. However …
employing off-policy evaluation methods to estimate their expected performance. However …
Safe data collection for offline and online policy learning
Motivated by practical needs of experimentation and policy learning in online platforms, we
study the problem of safe data collection. Specifically, our goal is to develop a logging policy …
study the problem of safe data collection. Specifically, our goal is to develop a logging policy …
Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning
In value-based deep reinforcement learning (RL), value function approximation errors lead
to suboptimal policies. Temporal difference (TD) learning is one of the most important …
to suboptimal policies. Temporal difference (TD) learning is one of the most important …