Advertisement

Proposal and Evaluation of an Indirect Reward Assignment Method for Reinforcement Learning by Profit Sharing Method

  • Kazuteru MiyazakiEmail author
  • Naoki Kodama
  • Hiroaki Kobayashi
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 868)

Abstract

We know Profit Sharing method that can guarantee a rationality in the case of acquiring a reward in reinforcement learning. This paper proposes a method that generates indirect rewards by Profit Sharing method in order to assure the rationality. The proposed method is applied to Deep Q-Network and the method is named DQNbyPS. It is shown that DQNbyPS can reduce the number of trial and error searches than the original Deep Q-Network in Pong that is one of Atari 2600 games.

Keywords

Reinforcement learning Profit sharing Deep learning Deep Q-network 

Notes

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number 17K00327.

References

  1. 1.
    Abbeel, P., Ng, A.Y.: Exploration and apprenticeship learning in reinforcement learning. In: Proceedings of 22nd International Conference on Machine Learning, pp. 1–8 (2005)Google Scholar
  2. 2.
    Chrisman, L.: Reinforcement learning with perceptual aliasing: the perceptual distinctions approach. In: Proceedings of the 10th National Conference on Artificial Intelligence, pp. 183–188 (1992)Google Scholar
  3. 3.
    Francois-Lavet, V., Fonteneau, R., Emst, D.: How to discount deep reinforcement learning: towards new dynamic strategies. In: NIPS 2015 Deep Reinforcement Learning Workshop (2015)Google Scholar
  4. 4.
    Gosavi, A.: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Mach. Learn. 55(1), 5–29 (2004)CrossRefGoogle Scholar
  5. 5.
    Gu, S., Lillicrap, T., Sutskever, I., Levine, S.: Continuous deep Q-learning with model-based acceleration. arXiv:1603 (2016)
  6. 6.
    Hausknecht, M., Stone, P.: Deep recurrent Q-learning for partially observable MDPs. arXiv:1507 (2015)
  7. 7.
    Kuroda, S., Miyazaki, K., Kobayashi, H.: Introduction of fixed mode states into online reinforcement learning with penalties and rewards and its application to biped robot waist trajectory generation. J. Adv. Comput. Intell. Intell. Inf. 16(6), 758–768 (2013)CrossRefGoogle Scholar
  8. 8.
    Matsui, T., Goto, T., Izumi, K.: Acquiring a government bond trading strategy using reinforcement learning. J. Adv. Comput. Intell. Intell. Inf. 13(6), 691–696 (2009)CrossRefGoogle Scholar
  9. 9.
    Merrick, K., Maher, M.L.: Motivated reinforcement learning for adaptive characters in open-ended simulation games. In: Proceedings of the International Conference on Advanced in Computer Entertainment Technology, pp. 127–134 (2007)Google Scholar
  10. 10.
    Miyazaki, K., Yamamura, M., Kobayashi, S: On the rationality of profit sharing in reinforcement learning. In: Proceedings of the 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, pp. 285–288 (1994)Google Scholar
  11. 11.
    Miyazaki, K., Yamamura, M., Kobayashi, S.: k-certainty exploration method: an action selector to identify the environment in reinforcement learning. Artif. Intell. 91(1), 155–171 (1997)CrossRefGoogle Scholar
  12. 12.
    Miyazaki, K., Kobayashi, S.: Learning deterministic policies in partially observable Markov decision processes. In: Proceedings of the 5th International Conference on Intelligent Autonomous System, pp. 250–257 (1998)Google Scholar
  13. 13.
    Miyazaki, K., Tsuboi, S., Kobayashi, S.: Reinforcement learning for penalty avoiding policy making and its extensions and an application to the Othello game. In: 7th International Conference on Information Systems Analysis and Synthesis (ISAS 2000), vol. 3, pp. 40–44 (2001)Google Scholar
  14. 14.
    Miyazaki, K., Kobayashi, S.: Rationality of reward sharing in multi-agent reinforcement learning. New Gener. Comput. 19(2), 157–172 (2001)CrossRefGoogle Scholar
  15. 15.
    Miyazaki, K., Kobayashi, S.: Exploitation-oriented learning PS-r\(^{\#}\). J. Adv. Comput. Intell. Intell. Inf. 13(6), 624–630 (2009)CrossRefGoogle Scholar
  16. 16.
    Miyazaki, K., Muraoka, H., Kobayashi, H.: Proposal of a propagation algorithm of the expected failure probability and the effectiveness on multi-agent environments. In: SICE Annual Conference 2013, pp. 1067–1072 (2013)Google Scholar
  17. 17.
    Miyazaki, K.: Exploitation-oriented learning with deep learning - introducing profit sharing to a deep Q-network. J. Adv. Comput. Intell. Intell. Inf. 21(5), 849–855 (2017)CrossRefGoogle Scholar
  18. 18.
    Miyazaki, K., Takeno, J.: The necessity of a secondary system in machine consciousness. Procedia Comput. Sci. 41, 15–22 (2014)CrossRefGoogle Scholar
  19. 19.
    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with deep reinforcement learning. In: NIPS Deep Learning Workshop 2013 (2013)Google Scholar
  20. 20.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015)CrossRefGoogle Scholar
  21. 21.
    Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., Maria, A.D., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Silver, D.: Massively parallel methods for deep reinforcement learning. In: ICML Deep Learning Workshop (2015)Google Scholar
  22. 22.
    Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: theory and application to reward shaping. In: Proceedings of 16th International Conference on Machine Learning, pp. 278–287 (1999)Google Scholar
  23. 23.
    Osband, I., Blundell, C., Pritzel, A., Roy, B.V.: Deep exploration via bootstrapped DQN. arXiv:1602 (2016)
  24. 24.
    Randl\(\phi \)v, J., Alstr\(\phi \)m, P.: Learning to drive a bicycle using reinforcement learning and shaping. In: Proceedings of the 15th International Conference on Machine Learning, pp. 463–471 (1998)Google Scholar
  25. 25.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. A Bradford Book. MIT Press, Cambridge (1998)Google Scholar
  26. 26.
    Stone, P., Sutton, R.S., Kuhlamann, G.: Reinforcement learning toward RoboCup soccer keepaway. Adapt. Behav. 13(3), 165–188 (2005)CrossRefGoogle Scholar
  27. 27.
    Watanabe, T., Miyazaki, K., Kobayashi, H.: A new improved penalty avoiding rational policy making algorithm for keepaway with continuous state spaces. J. Adv. Comput. Intell. Intell. Inf. 13(6), 675–682 (2009)CrossRefGoogle Scholar
  28. 28.
    Watkins, C.J.H., Dayan, P.: Technical note: Q-learning. Mach. Learn. 8, 55–68 (1992)Google Scholar
  29. 29.
    Yoshimoto, J., Nishimura, M., Tokita, Y., Ishii, S.: Acrobot control by learning the switching of multiple controllers. J. Artif. Life Rob. 9(2), 67–71 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Kazuteru Miyazaki
    • 1
    Email author
  • Naoki Kodama
    • 2
  • Hiroaki Kobayashi
    • 3
  1. 1.National Institution for Academic Degrees and Quality Enhancement of Higher EducationKodairaJapan
  2. 2.Tokyo University of ScienceNodaJapan
  3. 3.Meiji UniversityKawasakiJapan

Personalised recommendations