Optimism in reinforcement learning and Kullback-Leibler divergence- 学术资源搜索

Optimism in reinforcement learning and Kullback-Leibler divergence

S Filippi, O Cappé, A Garivier - 2010 48th Annual Allerton …, 2010 - ieeexplore.ieee.org

2010 48th Annual Allerton Conference on Communication, Control …, 2010•ieeexplore.ieee.org

We consider model-based reinforcement learning in finite Markov Decision Processes (MDPs), focussing on so-called optimistic strategies. In MDPs, optimism can be implemented by carrying out extended value iterations under a constraint of consistency with the estimated model transition probabilities. The UCRL2 algorithm by Auer, Jaksch and Ortner (2009), which follows this strategy, has recently been shown to guarantee near-optimal regret bounds. In this paper, we strongly argue in favor of using the Kullback-Leibler (KL) divergence for this purpose. By studying the linear maximization problem under KL constraints, we provide an efficient algorithm, termed KL-UCRL, for solving KL-optimistic extended value iteration. Using recent deviation bounds on the KL divergence, we prove that KL-UCRL provides the same guarantees as UCRL2 in terms of regret. However, numerical experiments on classical benchmarks show a significantly improved behavior, particularly when the MDP has reduced connectivity. To support this observation, we provide elements of comparison between the two algorithms based on geometric considerations.

ieeexplore.ieee.org

展开收起

被引用次数：135 相关文章所有 11 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果