Advertisement

Basis Expansion in Natural Actor Critic Methods

  • Sertan Girgin
  • Philippe Preux
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5323)

Abstract

In reinforcement learning, the aim of the agent is to find a policy that maximizes its expected return. Policy gradient methods try to accomplish this goal by directly approximating the policy using a parametric function approximator; the expected return of the current policy is estimated and its parameters are updated by steepest ascent in the direction of the gradient of the expected return with respect to the policy parameters. In general, the policy is defined in terms of a set of basis functions that capture important features of the problem. Since the quality of the resulting policies directly depend on the set of basis functions, and defining them gets harder as the complexity of the problem increases, it is important to be able to find them automatically. In this paper, we propose a new approach which uses cascade-correlation learning architecture for automatically constructing a set of basis functions within the context of Natural Actor-Critic (NAC) algorithms. Such basis functions allow more complex policies be represented, and consequently improve the performance of the resulting policies. We also present the effectiveness of the method empirically.

Keywords

Basis Function Reinforcement Learning Basis Expansion Natural Gradient Steep Ascent 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998); A Bradford BookGoogle Scholar
  2. 2.
    Howard, R.: Dynamic programming and Markov processes. MIT Press, Cambridge (1960)zbMATHGoogle Scholar
  3. 3.
    Puterman, M.: Markov Decision Processes — Discrete Stochastic Dynamic Programming. Probability and mathematical statistics. Wiley, Chichester (1994)zbMATHGoogle Scholar
  4. 4.
    Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Neural Information Processing Systems (NIPS), pp. 1057–1063 (1999)Google Scholar
  5. 5.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar
  6. 6.
    Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Peters, J., Schaal, S.: Policy gradient methods for robotics. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, October 2006, pp. 2219–2225 (2006)Google Scholar
  8. 8.
    Amari, S.i.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–276 (1998)CrossRefGoogle Scholar
  9. 9.
    Peters, J., Schaal, S.: Natural actor-critic. Neurocomput. 71(7-9), 1180–1190 (2008)CrossRefGoogle Scholar
  10. 10.
    Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Incremental natural actor-critic algorithms. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 105–112. MIT Press, Cambridge (2008)Google Scholar
  11. 11.
    Riedmiller, M., Peters, J., Schaal, S.: Evaluation of policy gradient methods and variants on the cart-pole benchmark. In: IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007, pp. 254–261 (2007)Google Scholar
  12. 12.
    Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems. Denver 1989, vol. 2, pp. 524–532. Morgan Kaufmann, San Mateo (1990)Google Scholar
  13. 13.
    Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: the rprop algorithm, vol. 1, pp. 586–591 (1993)Google Scholar
  14. 14.
    Guyon, I., Elisseff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)zbMATHGoogle Scholar
  15. 15.
    Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research 134, 215–238 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Keller, P.W., Mannor, S., Precup, D.: Automatic basis function construction for approximate dynamic programming and reinforcement learning. In: ICML, pp. 449–456. ACM, New York (2006)CrossRefGoogle Scholar
  17. 17.
    Parr, R., Painter-Wakefield, C., Li, L., Littman, M.: Analyzing feature generation for value-function approximation. In: ICML, pp. 737–744. ACM, New York (2007)CrossRefGoogle Scholar
  18. 18.
    Mahadevan, S.: Representation policy iteration. In: UAI, pp. 372–379 (2005)Google Scholar
  19. 19.
    Johns, J., Mahadevan, S.: Constructing basis functions from directed graphs for value function approximation. In: ICML, pp. 385–392. ACM, New York (2007)CrossRefGoogle Scholar
  20. 20.
    Mahadevan, S., Maggioni, M.: Proto-value functions: A laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research 8, 2169–2231 (2007)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Rivest, F., Precup, D.: Combining TD-learning with cascade-correlation networks. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 632–639. AAAI Press, Menlo Park (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Sertan Girgin
    • 1
  • Philippe Preux
    • 1
    • 2
  1. 1.Team-Project SequeL, INRIA Lille Nord-EuropeFrance
  2. 2.LIFL (UMR CNRS), Université de LilleFrance

Personalised recommendations