[PDF][PDF] On index policies for restless bandit problems

S Guha, K Munagala, P Shi - arXiv preprint arXiv:0711.3861, 2007 - Citeseer
arXiv preprint arXiv:0711.3861, 2007Citeseer
In this paper, we consider the restless bandit problem, which is one of the most well-studied
generalizations of the celebrated stochastic multi-armed bandit problem in decision theory.
In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to
approximate to any non-trivial factor, and little progress has been made on this problem
despite its significance in modeling activity allocation under uncertainty. We make progress
on this fundamental problem by showing that for an interesting and general subclass that we …
Abstract
In this paper, we consider the restless bandit problem, which is one of the most well-studied generalizations of the celebrated stochastic multi-armed bandit problem in decision theory. In its ultimate generality, the restless bandit problem is known to be PSPACE-Hard to approximate to any non-trivial factor, and little progress has been made on this problem despite its significance in modeling activity allocation under uncertainty. We make progress on this fundamental problem by showing that for an interesting and general subclass that we term RECOVERING bandits, a surprisingly simple and intuitive greedy policy yields a factor 2 approximation. Such greedy policies are termed index policies, and are popular due to their simplicity and their optimality for the stochastic multi-armed bandit problem. We develop several novel techniques in the design and analysis of the index policy. Our algorithm proceeds by modifying the constraints to the dual of a well-known LP relaxation to the restless bandit problem. This is followed by a structural characterization of the optimal solution by using both the exact primal as well as dual complementary slackness conditions. This yields an interpretation of the dual variables as potential functions from which we derive the index policy and the associated analysis. We believe such techniques will be of independent interest in related stochastic scheduling problems. The RECOVERING bandit problem strictly generalizes the stochastic multi-armed bandit problem, and naturally models multi-project scheduling where the state of a project becomes increasingly uncertain when the project is not scheduled. Such a notion is similar to (but not quite the same as) the notion of a Partially Observable Markov Decision Process (POMDP).
Citeseer
以上显示的是最相近的搜索结果。 查看全部搜索结果