Behavior Learning Based on a Policy Gradient Method: Separation of Environmental Dynamics and State Values in Policies

  • Seiji Ishihara
  • Harukazu Igarashi
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5351)


Policy gradient methods are very useful approaches in reinforcement learning. In our policy gradient approach to behavior learning of agents, we define an agent’s decision problem at each time step as a problem of minimizing an objective function. In this paper, we give an objective function that consists of two types of parameters representing environmental dynamics and state-value functions. We derive separate learning rules for the two types of parameters so that the two sets of parameters can be learned independently. Separating these two types of parameters will make it possible to reuse state-value functions for agents in other different environmental dynamics, even if the dynamics is stochastic. Our simulation experiments on learning hunter-agent policies in pursuit problems show the effectiveness of our method.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning. MIT Press, Cambridge (1998)Google Scholar
  2. 2.
    Williams, R.J.: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar
  3. 3.
    Kimura, H., Yamamura, M., Kobayashi, S.: Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward. In: Proceedings of the 12th International Conference on Machine Learning, pp. 295–303 (1995)Google Scholar
  4. 4.
    Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy Gradient Methods for Reinforcement Learning with Function Approximation. In: Advances in Neural Information Processing Systems (Proc. NIPS 1999 Conf.), vol. 12, pp. 1057–1063 (2000)Google Scholar
  5. 5.
    Konda, V.R., Tsitsiklis, J.N.: Actor-Critic Algorithms. In: Advances in Neural Information Processing Systems (Proc. NIPS 1999 Conf.), vol. 12, pp. 1008–1014 (2000)Google Scholar
  6. 6.
    Baird, L., Moore, A.: Gradient Descent for General Reinforcement Learning. In: Advances in Neural Information Processing Systems (Proc. NIPS 1998 Conf.), vol. 11, pp. 968–974 (1999)Google Scholar
  7. 7.
    Igarashi, H., Ishihara, S., Kimura, M.: Reinforcement Learning in Non-Markov Decision Processes —Statistical Properties of Characteristic Eligibility. IEICE Transactions on Information and Systems J90-D(9), 2271–2280 (2007) (in Japanese)Google Scholar
  8. 8.
    Ishihara, S., Igarashi, H.: Applying the Policy Gradient Method to Behavior Learning in Multi-agent Systems: The Pursuit Problem. Systems and Computers in Japan 37(10), 101–109 (2006)CrossRefGoogle Scholar
  9. 9.
    Peshkin, L., Kim, K.E., Meuleau, N., Kaelbling, L.P.: Learning to cooperative via policy search. In: Proc. of 16th Conference on Uncertainty in Artificial Intelligence (UAI 2000), pp. 489–496 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Seiji Ishihara
    • 1
  • Harukazu Igarashi
    • 2
  1. 1.Kinki UniversityHiroshimaJapan
  2. 2.Shibaura Institute of TechnologyTokyoJapan

Personalised recommendations