Applied Intelligence

, Volume 38, Issue 4, pp 479–488 | Cite as

Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game

Article

Abstract

The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.

Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.

Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.

Keywords

Bandit problems Goore Game Bayesian learning Decentralized decision making Quality of service control Wireless sensor networks 

References

  1. 1.
    Bouhmala N, Granmo O-C (2010) Combining finite learning automata with GSAT for the satisfiability problem. Eng Appl Artif Intell 23:715–726 CrossRefGoogle Scholar
  2. 2.
    Cao YU, Fukunaga AS, Kahng A (1997) Cooperative mobile robotics: antecedents and directions. Auton Robots 4(1):7–27 CrossRefGoogle Scholar
  3. 3.
    Chen D, Varshney PK (2004) QoS support in wireless sensor networks: a survey. In: The 2004 international conference on wireless networks (ICWN 2004) Google Scholar
  4. 4.
    Dimitrakakis C (2006) Nearly optimal exploration-exploitation decision thresholds. In: Proceedings of the 16th international conference on artificial neural networks (ICANN 2006). Lecture notes in computer science. Springer, Berlin, pp 850–859 Google Scholar
  5. 5.
    Gelly S, Wang Y (2006) Exploration exploitation in go: UCT for Monte-Carlo go. In: Proceedings of NIPS-2006, NIPS Google Scholar
  6. 6.
    Google. Google website optimizer: http://www.google.com/websiteoptimizer. Retrieved November 2011
  7. 7.
    Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In: Proceedings of the twenty-seventh international conference on machine learning (ICML-10), p 1320 Google Scholar
  8. 8.
    Granmo O-C (2010) Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton. Int J Intell Comput Cybern 3(2):207–234 MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Granmo O-C, Berg S (2010) Solving non-stationary bandit problems by random sampling from Sibling Kalman filters. In: Proceedings of the twenty third international conference on industrial, engineering, and other applications of applied intelligent systems (IEA-AIE 2010). Springer, Berlin, pp 199–208 Google Scholar
  10. 10.
    Granmo O-C, Bouhmala N (2007) Solving the satisfiability problem using finite learning automata. Int J Comput Sci Appl 4(3):15–29 Google Scholar
  11. 11.
    Granmo O-C, Oommen BJ, Myrer SA, Olsen MG (2007) Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation. IEEE Trans Syst Man Cybern, Part B, Cybern 37(1):166–175 CrossRefGoogle Scholar
  12. 12.
    Granmo O-C, Oommen BJ (2010) Solving stochastic nonlinear resource allocation problems using a hierarchy of twofold resource allocation automata. IEEE Trans Comput 59(4):545–560 MathSciNetCrossRefGoogle Scholar
  13. 13.
    Gupta N, Granmo O-C, Agrawala A (2011) Successive reduction of arms in multi-armed bandits. In: Proceedings of the thirty-first SGAI international conference on artificial intelligence (SGAI 2011). Springer, Berlin Google Scholar
  14. 14.
    Gupta N, Granmo O-C, Agrawala A (2011) Thompson sampling for dynamic multi-armed bandits. In: Proceedings of the tenth international conference on machine learning and applications (ICMLA’11). IEEE, New York Google Scholar
  15. 15.
    Iyer R, Kleinrock L (2003) QoS control for sensor networks. In: IEEE international conference on communications, vol 1, pp 517–521 Google Scholar
  16. 16.
    May BC, Korda N, Lee A, Leslie DS (2011) Optimistic Bayesian sampling in contextual-bandit problems. Technical report, Statistics Group, Department of Mathematics, University of Bristol Google Scholar
  17. 17.
    Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice Hall, New York Google Scholar
  18. 18.
    Oommen BJ, Granmo O-C (2008) Learning automata-based solutions to the Goore game and its applications. In: Game theory: strategies, equilibria, and theorems. Nova Science Publishers, New York Google Scholar
  19. 19.
    Oommen BJ, Misra S, Granmo O-C (2007) Routing bandwidth guaranteed paths in MPLS traffic engineering: a multiple race track learning approach. IEEE Trans Comput 56(7):959–976 MathSciNetCrossRefGoogle Scholar
  20. 20.
    Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry (26):639–658 Google Scholar
  21. 21.
    Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–294 MATHGoogle Scholar
  22. 22.
    Tsetlin ML (1973) Automaton theory and modeling of biological systems. Academic Press, San Diego Google Scholar
  23. 23.
    Tung B, Kleinrock L (1996) Using finite state automata to produce self-optimization and self-control. IEEE Trans Parallel Distrib Syst 7(4):47–61 CrossRefGoogle Scholar
  24. 24.
    Wang T, Lizotte D, Bowling M, Scuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning, pp 956–963 Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Information and Communication TechnologyUniversity of AgderKristiansandNorway

Personalised recommendations