Abstract
The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.
Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.
Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.
Notes
By this we mean that P is not a fixed function. Rather, it denotes the probability function for a random variable, given as an argument to P.
References
Bouhmala N, Granmo O-C (2010) Combining finite learning automata with GSAT for the satisfiability problem. Eng Appl Artif Intell 23:715–726
Cao YU, Fukunaga AS, Kahng A (1997) Cooperative mobile robotics: antecedents and directions. Auton Robots 4(1):7–27
Chen D, Varshney PK (2004) QoS support in wireless sensor networks: a survey. In: The 2004 international conference on wireless networks (ICWN 2004)
Dimitrakakis C (2006) Nearly optimal exploration-exploitation decision thresholds. In: Proceedings of the 16th international conference on artificial neural networks (ICANN 2006). Lecture notes in computer science. Springer, Berlin, pp 850–859
Gelly S, Wang Y (2006) Exploration exploitation in go: UCT for Monte-Carlo go. In: Proceedings of NIPS-2006, NIPS
Google. Google website optimizer: http://www.google.com/websiteoptimizer. Retrieved November 2011
Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In: Proceedings of the twenty-seventh international conference on machine learning (ICML-10), p 1320
Granmo O-C (2010) Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton. Int J Intell Comput Cybern 3(2):207–234
Granmo O-C, Berg S (2010) Solving non-stationary bandit problems by random sampling from Sibling Kalman filters. In: Proceedings of the twenty third international conference on industrial, engineering, and other applications of applied intelligent systems (IEA-AIE 2010). Springer, Berlin, pp 199–208
Granmo O-C, Bouhmala N (2007) Solving the satisfiability problem using finite learning automata. Int J Comput Sci Appl 4(3):15–29
Granmo O-C, Oommen BJ, Myrer SA, Olsen MG (2007) Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation. IEEE Trans Syst Man Cybern, Part B, Cybern 37(1):166–175
Granmo O-C, Oommen BJ (2010) Solving stochastic nonlinear resource allocation problems using a hierarchy of twofold resource allocation automata. IEEE Trans Comput 59(4):545–560
Gupta N, Granmo O-C, Agrawala A (2011) Successive reduction of arms in multi-armed bandits. In: Proceedings of the thirty-first SGAI international conference on artificial intelligence (SGAI 2011). Springer, Berlin
Gupta N, Granmo O-C, Agrawala A (2011) Thompson sampling for dynamic multi-armed bandits. In: Proceedings of the tenth international conference on machine learning and applications (ICMLA’11). IEEE, New York
Iyer R, Kleinrock L (2003) QoS control for sensor networks. In: IEEE international conference on communications, vol 1, pp 517–521
May BC, Korda N, Lee A, Leslie DS (2011) Optimistic Bayesian sampling in contextual-bandit problems. Technical report, Statistics Group, Department of Mathematics, University of Bristol
Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice Hall, New York
Oommen BJ, Granmo O-C (2008) Learning automata-based solutions to the Goore game and its applications. In: Game theory: strategies, equilibria, and theorems. Nova Science Publishers, New York
Oommen BJ, Misra S, Granmo O-C (2007) Routing bandwidth guaranteed paths in MPLS traffic engineering: a multiple race track learning approach. IEEE Trans Comput 56(7):959–976
Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry (26):639–658
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–294
Tsetlin ML (1973) Automaton theory and modeling of biological systems. Academic Press, San Diego
Tung B, Kleinrock L (1996) Using finite state automata to produce self-optimization and self-control. IEEE Trans Parallel Distrib Syst 7(4):47–61
Wang T, Lizotte D, Bowling M, Scuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning, pp 956–963
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of this paper was presented at IEA/AIE’11, the 2011 International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Syracuse, NY, USA, in June 2011.
Rights and permissions
About this article
Cite this article
Granmo, OC., Glimsdal, S. Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game. Appl Intell 38, 479–488 (2013). https://doi.org/10.1007/s10489-012-0346-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-012-0346-z