Advertisement

Tug-of-War Model for Multi-armed Bandit Problem

  • Song-Ju Kim
  • Masashi Aono
  • Masahiko Hara
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6079)

Abstract

We propose a model – the “tug-of-war (TOW) model” – to conduct unique parallel searches using many nonlocally correlated search agents. The model is based on the property of a single-celled amoeba, the true slime mold Physarum, which maintains a constant intracellular resource volume while collecting environmental information by concurrently expanding and shrinking its branches. The conservation law entails a “nonlocal correlation” among the branches, i.e., volume increment in one branch is immediately compensated by volume decrement(s) in the other branch(es). This nonlocal correlation was shown to be useful for decision making in the case of a dilemma. The multi-armed bandit problem is to determine the optimal strategy for maximizing the total reward sum with incompatible demands. Our model can efficiently manage this “exploration–exploitation dilemma” and exhibits good performances. The average accuracy rate of our model is higher than those of well-known algorithms such as the modified ε-greedy algorithm and modified softmax algorithm.

Keywords

Multi-armed bandit problem reinforcement learning bio-inspired computation amoeba-based computing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Nakagaki, T., Yamada, H., Toth, A.: Maze-solving by an amoeboid organism. Nature 407, 470 (2000)CrossRefGoogle Scholar
  2. 2.
    Tero, A., Kobayashi, R., Nakagaki, T.: Physarum solver: A biologically inspired method of road-network navigation. Physica A 363, 115–119 (2006)CrossRefGoogle Scholar
  3. 3.
    Nakagaki, T., Iima, M., Ueda, T., Nishiura, Y., Saigusa, T., Tero, A., Kobayashi, R., Showalter, K.: Minimum-risk path finding by an adaptive amoebal network. Phys. Rev. Lett. 99, 068104 (2007)CrossRefGoogle Scholar
  4. 4.
    Saigusa, T., Tero, A., Nakagaki, T., Kuramoto, Y.: Amoebae anticipate periodic events. Phys. Rev. Lett. 100, 018101 (2008)CrossRefGoogle Scholar
  5. 5.
    Aono, M., Hara, M., Aihara, K.: Amoeba-based neurocomputing with chaotic dynamics. Communications of the ACM 50(9), 69–72 (2007)CrossRefGoogle Scholar
  6. 6.
    Aono, M., Hara, M.: Spontaneous deadlock breaking on amoeba-based neurocomputer. BioSystems 91, 83–93 (2008)CrossRefGoogle Scholar
  7. 7.
    Aono, M., Hirata, Y., Hara, M., Aihara, K.: Amoeba-based chaotic neurocomputing: Combinatorial optimization by coupled biological oscillators. New Generation Computing 27, 129–157 (2009)zbMATHCrossRefGoogle Scholar
  8. 8.
    Aono, M., Hirata, Y., Hara, M., Aihara, K.: Resource-competing oscillator network as a model of amoeba-based neurocomputer. In: Calude, C.S., Costa, J.F., Dershowitz, N., Freire, E., Rozenberg, G. (eds.) UC 2009. LNCS, vol. 5715, pp. 56–69. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  9. 9.
    Kim, S.-J., Aono, M., Hara, M.: Tug-of-war model for two-bandit problem. In: Calude, C.S., Costa, J.F., Dershowitz, N., Freire, E., Rozenberg, G. (eds.) UC 2009. LNCS, vol. 5715, p. 289. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  10. 10.
    Kim, S.-J., Aono, M., Hara, M.: Tug-of-war model for the two-bandit problem: nonlocally-correlated parallel exploration via resource conservation (submitted)Google Scholar
  11. 11.
    Robbins, H.: Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58, 527–536 (1952)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Thompson, W.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)zbMATHGoogle Scholar
  13. 13.
    Gittins, J., Jones, D.: A dynamic allocation index for the sequential design of experiments. In: Gans, J. (ed.) Progress in Statistics, pp. 241–266. North Holland, Amsterdam (1974)Google Scholar
  14. 14.
    Gittins, J.: Bandit processes and dynamic allocation indices. J. R. Stat. Soc. B 41, 148–177 (1979)zbMATHMathSciNetGoogle Scholar
  15. 15.
    Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Agrawal, R.: Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Adv. Appl. Prob. 27, 1054–1078 (1995)zbMATHCrossRefGoogle Scholar
  17. 17.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 235–256 (2002)zbMATHCrossRefGoogle Scholar
  18. 18.
    Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L., et al. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 437–448. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  19. 19.
    Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press, Cambridge (1998)Google Scholar
  20. 20.
    Daw, N., O’Doherty, J., Dayan, P., Seymour, B., Dolan, R.: Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006)CrossRefGoogle Scholar
  21. 21.
    Cohen, J., McClure, S., Yu, A.: Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Phil. Trans. R. Soc. B 362(1481), 933–942 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Song-Ju Kim
    • 1
  • Masashi Aono
    • 1
  • Masahiko Hara
    • 1
  1. 1.Flucto-Order Functions Research Team, RIKEN-HYU Collaboration Research Center, Advanced Science Institute, RIKEN, Fusion Technology Center 5F, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul 133-791, Korea, 2-1 Hirosawa, Wako-shi, Saitama 351-0198Japan

Personalised recommendations