Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game
The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.
Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.
Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.
KeywordsBandit problems Goore Game Bayesian learning Decentralized decision making Quality of service control Wireless sensor networks
- 3.Chen D, Varshney PK (2004) QoS support in wireless sensor networks: a survey. In: The 2004 international conference on wireless networks (ICWN 2004) Google Scholar
- 4.Dimitrakakis C (2006) Nearly optimal exploration-exploitation decision thresholds. In: Proceedings of the 16th international conference on artificial neural networks (ICANN 2006). Lecture notes in computer science. Springer, Berlin, pp 850–859 Google Scholar
- 5.Gelly S, Wang Y (2006) Exploration exploitation in go: UCT for Monte-Carlo go. In: Proceedings of NIPS-2006, NIPS Google Scholar
- 6.Google. Google website optimizer: http://www.google.com/websiteoptimizer. Retrieved November 2011
- 7.Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In: Proceedings of the twenty-seventh international conference on machine learning (ICML-10), p 1320 Google Scholar
- 9.Granmo O-C, Berg S (2010) Solving non-stationary bandit problems by random sampling from Sibling Kalman filters. In: Proceedings of the twenty third international conference on industrial, engineering, and other applications of applied intelligent systems (IEA-AIE 2010). Springer, Berlin, pp 199–208 Google Scholar
- 10.Granmo O-C, Bouhmala N (2007) Solving the satisfiability problem using finite learning automata. Int J Comput Sci Appl 4(3):15–29 Google Scholar
- 13.Gupta N, Granmo O-C, Agrawala A (2011) Successive reduction of arms in multi-armed bandits. In: Proceedings of the thirty-first SGAI international conference on artificial intelligence (SGAI 2011). Springer, Berlin Google Scholar
- 14.Gupta N, Granmo O-C, Agrawala A (2011) Thompson sampling for dynamic multi-armed bandits. In: Proceedings of the tenth international conference on machine learning and applications (ICMLA’11). IEEE, New York Google Scholar
- 15.Iyer R, Kleinrock L (2003) QoS control for sensor networks. In: IEEE international conference on communications, vol 1, pp 517–521 Google Scholar
- 16.May BC, Korda N, Lee A, Leslie DS (2011) Optimistic Bayesian sampling in contextual-bandit problems. Technical report, Statistics Group, Department of Mathematics, University of Bristol Google Scholar
- 17.Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice Hall, New York Google Scholar
- 18.Oommen BJ, Granmo O-C (2008) Learning automata-based solutions to the Goore game and its applications. In: Game theory: strategies, equilibria, and theorems. Nova Science Publishers, New York Google Scholar
- 20.Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry (26):639–658 Google Scholar
- 22.Tsetlin ML (1973) Automaton theory and modeling of biological systems. Academic Press, San Diego Google Scholar
- 24.Wang T, Lizotte D, Bowling M, Scuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning, pp 956–963 Google Scholar