Solving Deep Memory POMDPs with Recurrent Policy Gradients

  • Daan Wierstra
  • Alexander Foerster
  • Jan Peters
  • Jürgen Schmidhuber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4668)


This paper presents Recurrent Policy Gradients, a model-free reinforcement learning (RL) method creating limited-memory sto-chastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations. The approach involves approximating a policy gradient for a Recurrent Neural Network (RNN) by backpropagating return-weighted characteristic eligibilities through time. Using a “Long Short-Term Memory” architecture, we are able to outperform other RL methods on two important benchmark tasks. Furthermore, we show promising results on a complex car driving simulation task.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Benbrahim, H., Franklin, J.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems Journal (1997)Google Scholar
  2. 2.
    Moody, J., Saffell, M.: Learning to Trade via Direct Reinforcement. IEEE Transactions on Neural Networks 12(4), 875–889 (2001)CrossRefGoogle Scholar
  3. 3.
    Prokhorov, D.: Toward effective combination of off-line and on-line training in adp framework. In: ADPRL. Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, IEEE Computer Society Press, Los Alamitos (2007)Google Scholar
  4. 4.
    Baxter, J., Bartlett, P., Weaver, L.: Experiments with infinite-horizon, policy- gradient estimation. Journal of Artificial Intelligence Research 15, 351–381 (2001)zbMATHGoogle Scholar
  5. 5.
    Peters, J., Schaal, S.: Policy gradient methods for robotics. In: IROS. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, pp. 2219–2225 (2006)Google Scholar
  6. 6.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)zbMATHGoogle Scholar
  7. 7.
    Gullapalli, V.: A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks 3(6), 671–692 (1990)CrossRefGoogle Scholar
  8. 8.
    Schraudolph, N., Yu, J., Aberdeen, D.: Fast online policy gradient learning with smd gain vector adaptation. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems, vol. 18, MIT Press, Cambridge, MA (2006)Google Scholar
  9. 9.
    Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 280–291. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation (2001)Google Scholar
  11. 11.
    Gullapalli, V.: Reinforcement learning and its application to control (1992)Google Scholar
  12. 12.
    Werbos, P.: Back propagation through time: What it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990)CrossRefGoogle Scholar
  13. 13.
    Singh, S.P., Jaakkola, T., Jordan, M.I.: Learning without state-estimation in partially observable markovian decision processes. In: International Conference on Machine Learning, pp. 284–292 (1994)Google Scholar
  14. 14.
    Aberdeen, D.: Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University (2003)Google Scholar
  15. 15.
    Meuleau, N., Peshkin, L., Kim, K.-E., Kaelbling, L.P.: Learning finite-state controllers for partially observable environments. In: UAI ’99. Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 427–436. Morgan Kaufmann, San Francisco (1999)Google Scholar
  16. 16.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  17. 17.
    Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural Information Processing Syst., vol. 14 (2002)Google Scholar
  18. 18.
    Baxter, J., Bartlett, P.: Direct gradient-based reinforcement learning (1999)Google Scholar
  19. 19.
    Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, NJ, New York (2001)Google Scholar
  20. 20.
    Schmidhuber, J.: RNN overview (2004),
  21. 21.
    Wieland, A.: Evolving neural network controllers for unstable systems. In: Proceedings of the International Joint Conference on Neural Networks, Seattle, WA, pp. 667–673. IEEE Service Center, Piscataway, NJ (1991)Google Scholar
  22. 22.
    Torcs: Torcs, the open racing car simulator (2007),

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Daan Wierstra
    • 1
  • Alexander Foerster
    • 1
  • Jan Peters
    • 2
  • Jürgen Schmidhuber
    • 1
  1. 1.IDSIA, Galleria 2, 6928 Manno-LuganoSwitzerland
  2. 2.University of Southern California, Los Angeles, CAUSA

Personalised recommendations