Skip to main content
Log in

Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game

Applied Intelligence Aims and scope Submit manuscript

Abstract

The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.

Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.

Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Notes

  1. By this we mean that P is not a fixed function. Rather, it denotes the probability function for a random variable, given as an argument to P.

References

  1. Bouhmala N, Granmo O-C (2010) Combining finite learning automata with GSAT for the satisfiability problem. Eng Appl Artif Intell 23:715–726

    Article  Google Scholar 

  2. Cao YU, Fukunaga AS, Kahng A (1997) Cooperative mobile robotics: antecedents and directions. Auton Robots 4(1):7–27

    Article  Google Scholar 

  3. Chen D, Varshney PK (2004) QoS support in wireless sensor networks: a survey. In: The 2004 international conference on wireless networks (ICWN 2004)

    Google Scholar 

  4. Dimitrakakis C (2006) Nearly optimal exploration-exploitation decision thresholds. In: Proceedings of the 16th international conference on artificial neural networks (ICANN 2006). Lecture notes in computer science. Springer, Berlin, pp 850–859

    Google Scholar 

  5. Gelly S, Wang Y (2006) Exploration exploitation in go: UCT for Monte-Carlo go. In: Proceedings of NIPS-2006, NIPS

    Google Scholar 

  6. Google. Google website optimizer: http://www.google.com/websiteoptimizer. Retrieved November 2011

  7. Graepel T, Candela JQ, Borchert T, Herbrich R (2010) Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In: Proceedings of the twenty-seventh international conference on machine learning (ICML-10), p 1320

    Google Scholar 

  8. Granmo O-C (2010) Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton. Int J Intell Comput Cybern 3(2):207–234

    Article  MathSciNet  MATH  Google Scholar 

  9. Granmo O-C, Berg S (2010) Solving non-stationary bandit problems by random sampling from Sibling Kalman filters. In: Proceedings of the twenty third international conference on industrial, engineering, and other applications of applied intelligent systems (IEA-AIE 2010). Springer, Berlin, pp 199–208

    Google Scholar 

  10. Granmo O-C, Bouhmala N (2007) Solving the satisfiability problem using finite learning automata. Int J Comput Sci Appl 4(3):15–29

    Google Scholar 

  11. Granmo O-C, Oommen BJ, Myrer SA, Olsen MG (2007) Learning automata-based solutions to the nonlinear fractional knapsack problem with applications to optimal resource allocation. IEEE Trans Syst Man Cybern, Part B, Cybern 37(1):166–175

    Article  Google Scholar 

  12. Granmo O-C, Oommen BJ (2010) Solving stochastic nonlinear resource allocation problems using a hierarchy of twofold resource allocation automata. IEEE Trans Comput 59(4):545–560

    Article  MathSciNet  Google Scholar 

  13. Gupta N, Granmo O-C, Agrawala A (2011) Successive reduction of arms in multi-armed bandits. In: Proceedings of the thirty-first SGAI international conference on artificial intelligence (SGAI 2011). Springer, Berlin

    Google Scholar 

  14. Gupta N, Granmo O-C, Agrawala A (2011) Thompson sampling for dynamic multi-armed bandits. In: Proceedings of the tenth international conference on machine learning and applications (ICMLA’11). IEEE, New York

    Google Scholar 

  15. Iyer R, Kleinrock L (2003) QoS control for sensor networks. In: IEEE international conference on communications, vol 1, pp 517–521

    Google Scholar 

  16. May BC, Korda N, Lee A, Leslie DS (2011) Optimistic Bayesian sampling in contextual-bandit problems. Technical report, Statistics Group, Department of Mathematics, University of Bristol

  17. Narendra KS, Thathachar MAL (1989) Learning automata: an introduction. Prentice Hall, New York

    Google Scholar 

  18. Oommen BJ, Granmo O-C (2008) Learning automata-based solutions to the Goore game and its applications. In: Game theory: strategies, equilibria, and theorems. Nova Science Publishers, New York

    Google Scholar 

  19. Oommen BJ, Misra S, Granmo O-C (2007) Routing bandwidth guaranteed paths in MPLS traffic engineering: a multiple race track learning approach. IEEE Trans Comput 56(7):959–976

    Article  MathSciNet  Google Scholar 

  20. Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry (26):639–658

  21. Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–294

    MATH  Google Scholar 

  22. Tsetlin ML (1973) Automaton theory and modeling of biological systems. Academic Press, San Diego

    Google Scholar 

  23. Tung B, Kleinrock L (1996) Using finite state automata to produce self-optimization and self-control. IEEE Trans Parallel Distrib Syst 7(4):47–61

    Article  Google Scholar 

  24. Wang T, Lizotte D, Bowling M, Scuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning, pp 956–963

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ole-Christoffer Granmo.

Additional information

A preliminary version of this paper was presented at IEA/AIE’11, the 2011 International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, Syracuse, NY, USA, in June 2011.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Granmo, OC., Glimsdal, S. Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game. Appl Intell 38, 479–488 (2013). https://doi.org/10.1007/s10489-012-0346-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-012-0346-z

Keywords

Navigation