Can we make Thompson sampling work for mechanism design for stochastic multi-armed bandits?
In this work we consider mechanism design for multiarmed bandit problems and explore for the first time a Bayesian approach for the problem. In particular, we propose a Thompson sampling based algorithm to learn the rewards of the strategic agents. Thompson sampling algorithm achieves better theoretical regret guarantees, however, its randomized nature introduces certain unique, non-trivial challenges for mechanism design. Our main contribution is the proposal of the TSM-D mechanism which achieves the same social welfare regret as that of Thompson sampling algorithm. We further prove the following properties of TSM-D: (a) the variance in agent utilities asymptotically tends to zero; (b) the mechanism is ex-post individually rational with high probability; and (c) the mechanism is within period dominant strategy incentive compatible with high probability.
Reference:
Ganesh Ghalme, Shweta Jain, Sujit Gujar, and Y. Narahari. Thompson sampling based mechanisms for stochastic multi-armed bandit problems. Proceedings of AAMAS-2017 (16th International Conference on Autonomous Agents and Multi Agent Systems), Sao Paulo, Brazil, 2017, pp. 87-95.