Thompson sampling with less exploration is fast and optimal
Abstract We propose $\epsilon $-Exploring Thompson Sampling ($\epsilon $-TS), a
modified version of the Thompson Sampling (TS) algorithm for multi-armed bandits. In
$\epsilon $-TS, arms are selected greedily based on empirical mean rewards with
probability $1-\epsilon $, and based on posterior samples obtained from TS with probability
$\epsilon $. Here, $\epsilon\in (0, 1) $ is a user-defined constant. By reducing exploration,
$\epsilon $-TS improves computational efficiency compared to TS while achieving better …
modified version of the Thompson Sampling (TS) algorithm for multi-armed bandits. In
$\epsilon $-TS, arms are selected greedily based on empirical mean rewards with
probability $1-\epsilon $, and based on posterior samples obtained from TS with probability
$\epsilon $. Here, $\epsilon\in (0, 1) $ is a user-defined constant. By reducing exploration,
$\epsilon $-TS improves computational efficiency compared to TS while achieving better …
Abstract
We propose -Exploring Thompson Sampling (-TS), a modified version of the Thompson Sampling (TS) algorithm for multi-armed bandits. In -TS, arms are selected greedily based on empirical mean rewards with probability , and based on posterior samples obtained from TS with probability . Here, is a user-defined constant. By reducing exploration, -TS improves computational efficiency compared to TS while achieving better regret bounds. We establish that -TS is both minimax optimal and asymptotically optimal for various popular reward distributions, including Gaussian, Bernoulli, Poisson, and Gamma. A key technical advancement in our analysis is the relaxation of the requirement for a stringent anti-concentration bound of the posterior distribution, which was necessary in recent analyses that achieved similar bounds. As a result, -TS maintains the posterior update structure of TS while minimizing alterations, such as clipping the sampling distribution or solving the inverse of the Kullback-Leibler (KL) divergence between reward distributions, as done in previous work. Furthermore, our algorithm is as easy to implement as TS, but operates significantly faster due to reduced exploration. Empirical evaluations confirm the efficiency and optimality of -TS.
proceedings.mlr.press
以上显示的是最相近的搜索结果。 查看全部搜索结果