Off-policy evaluation via adaptive weighting with data from contextual bandits- 学术资源搜索

文章

学术资源搜索

获得 2 条结果（用时0.10秒）

深度强化学习综述

王浩楠，刘苧，章艺云，冯大伟，黄峰… - 信息与电子工程前沿 …, 2022 - fitee.zjujournals.com

… This not only ensures the stability of optimization but also makes full use of off-policy data to
improve … Second, the output layers of the model are adapted to the target domain. Finally, the …

[PDF] researchgate.net

[PDF][PDF] 深度强化学习综述: 兼论计算机围棋的发展

赵冬斌，邵坤，朱圆恒，李栋，陈亚冉，王海涛… - 控制理论与 …, 2016 - researchgate.net

… 蒙特卡罗方法同时还可以与离策略(off-policy)的思想相结合, 得到离策略的蒙特卡罗学习, 能够
… 目前的一个研究趋势是用离线估计来处理上下文赌机(contextual bandit) 问题. 例如, 微软研究…

被引用次数：38 相关文章所有 4 个版本