Service interruption on Monday 11 July from 12:30 to 13:00: all the sites of the CCSD (HAL, Epiciences, SciencesConf, AureHAL) will be inaccessible (network hardware connection).
Abstract : The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933.
In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches
the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical
comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.
Emilie Kaufmann, Nathaniel Korda, Rémi Munos. Thompson Sampling : an asymptotically optimal finite time analysis. International Conference on Algorithmic Learning Theory, Nov 2012, Lyon, France. pp.199-213. ⟨hal-02286442⟩