Solving Deep Memory POMDPs with Recurrent Policy Gradients

Wierstra, Daan; Foerster, Alexander; Peters, Jan; Schmidhuber, Jürgen

doi:10.1007/978-3-540-74690-4_71

Daan Wierstra¹,
Alexander Foerster¹,
Jan Peters² &
…
Jürgen Schmidhuber¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4668))

Included in the following conference series:

International Conference on Artificial Neural Networks

3018 Accesses
49 Citations

Abstract

This paper presents Recurrent Policy Gradients, a model-free reinforcement learning (RL) method creating limited-memory sto-chastic policies for partially observable Markov decision problems (POMDPs) that require long-term memories of past observations. The approach involves approximating a policy gradient for a Recurrent Neural Network (RNN) by backpropagating return-weighted characteristic eligibilities through time. Using a “Long Short-Term Memory” architecture, we are able to outperform other RL methods on two important benchmark tasks. Furthermore, we show promising results on a complex car driving simulation task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Benbrahim, H., Franklin, J.: Biped dynamic walking using reinforcement learning. Robotics and Autonomous Systems Journal (1997)
Google Scholar
Moody, J., Saffell, M.: Learning to Trade via Direct Reinforcement. IEEE Transactions on Neural Networks 12(4), 875–889 (2001)
Article Google Scholar
Prokhorov, D.: Toward effective combination of off-line and on-line training in adp framework. In: ADPRL. Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, IEEE Computer Society Press, Los Alamitos (2007)
Google Scholar
Baxter, J., Bartlett, P., Weaver, L.: Experiments with infinite-horizon, policy- gradient estimation. Journal of Artificial Intelligence Research 15, 351–381 (2001)
MATH Google Scholar
Peters, J., Schaal, S.: Policy gradient methods for robotics. In: IROS. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, pp. 2219–2225 (2006)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)
MATH Google Scholar
Gullapalli, V.: A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks 3(6), 671–692 (1990)
Article Google Scholar
Schraudolph, N., Yu, J., Aberdeen, D.: Fast online policy gradient learning with smd gain vector adaptation. In: Weiss, Y., Schölkopf, B., Platt, J. (eds.) Advances in Neural Information Processing Systems, vol. 18, MIT Press, Cambridge, MA (2006)
Google Scholar
Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 280–291. Springer, Heidelberg (2005)
Chapter Google Scholar
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation (2001)
Google Scholar
Gullapalli, V.: Reinforcement learning and its application to control (1992)
Google Scholar
Werbos, P.: Back propagation through time: What it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990)
Article Google Scholar
Singh, S.P., Jaakkola, T., Jordan, M.I.: Learning without state-estimation in partially observable markovian decision processes. In: International Conference on Machine Learning, pp. 284–292 (1994)
Google Scholar
Aberdeen, D.: Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University (2003)
Google Scholar
Meuleau, N., Peshkin, L., Kim, K.-E., Kaelbling, L.P.: Learning finite-state controllers for partially observable environments. In: UAI ’99. Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 427–436. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Article Google Scholar
Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural Information Processing Syst., vol. 14 (2002)
Google Scholar
Baxter, J., Bartlett, P.: Direct gradient-based reinforcement learning (1999)
Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, NJ, New York (2001)
Google Scholar
Schmidhuber, J.: RNN overview (2004), http://www.idsia.ch/~juergen/rnn.html
Wieland, A.: Evolving neural network controllers for unstable systems. In: Proceedings of the International Joint Conference on Neural Networks, Seattle, WA, pp. 667–673. IEEE Service Center, Piscataway, NJ (1991)
Google Scholar
Torcs: Torcs, the open racing car simulator (2007), http://torcs.sourceforge.net/

Download references

Author information

Authors and Affiliations

IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland
Daan Wierstra, Alexander Foerster & Jürgen Schmidhuber
University of Southern California, Los Angeles, CA, USA
Jan Peters

Authors

Daan Wierstra
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Foerster
View author publications
You can also search for this author in PubMed Google Scholar
Jan Peters
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Schmidhuber
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joaquim Marques de Sá Luís A. Alexandre Włodzisław Duch Danilo Mandic

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J. (2007). Solving Deep Memory POMDPs with Recurrent Policy Gradients. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds) Artificial Neural Networks – ICANN 2007. ICANN 2007. Lecture Notes in Computer Science, vol 4668. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74690-4_71

Download citation

DOI: https://doi.org/10.1007/978-3-540-74690-4_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74689-8
Online ISBN: 978-3-540-74690-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics