Politex: Regret bounds for policy iteration using expert prediction
International Conference on Machine Learning, 2019•proceedings.mlr.press
Abstract We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy
iteration where each policy is a Boltzmann distribution over the sum of action-value function
estimates of the previous policies, and analyze its regret in continuing RL problems. We
assume that the value function error after running a policy for $\tau $ time steps scales as
$\epsilon (\tau)=\epsilon_0+ O (\sqrt {d/\tau}) $, where $\epsilon_0 $ is the worst-case
approximation error and $ d $ is the number of features in a compressed representation of …
iteration where each policy is a Boltzmann distribution over the sum of action-value function
estimates of the previous policies, and analyze its regret in continuing RL problems. We
assume that the value function error after running a policy for $\tau $ time steps scales as
$\epsilon (\tau)=\epsilon_0+ O (\sqrt {d/\tau}) $, where $\epsilon_0 $ is the worst-case
approximation error and $ d $ is the number of features in a compressed representation of …
Abstract
We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for time steps scales as , where is the worst-case approximation error and is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as , where hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.
proceedings.mlr.press
以上显示的是最相近的搜索结果。 查看全部搜索结果