Abstract
We present Policy Gradient Actor-Critic (PGAC), a new model-free Reinforcement Learning (RL) method for creating limited-memory stochastic policies for Partially Observable Markov Decision Processes (POMDPs) that require long-term memories of past observations and actions. The approach involves estimating a policy gradient for an Actor through a Policy Gradient Critic which evaluates probability distributions on actions. Gradient-based updates of history-conditional action probability distributions enable the algorithm to learn a mapping from memory states (or event histories) to probability distributions on actions, solving POMDPs through a combination of memory and stochasticity. This goes beyond previous approaches to learning purely reactive POMDP policies, without giving up their advantages. Preliminary results on important benchmark tasks show that our approach can in principle be used as a general purpose POMDP algorithm that solves RL problems in both continuous and discrete action domains.
Keywords
- Reinforcement Learn
- Recurrent Neural Network
- Partially Observable Markov Decision Process
- Reinforcement Learn Algorithm
- Reinforcement Learn Method
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download conference paper PDF
References
Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press, Cambridge, MA (1998)
Singh, S., Jaakkola, T., Jordan, M.: Learning without state-estimation in partially observable markovian decision processes. In: International Conference on Machine Learning, pp. 284–292 (1994)
Aberdeen, D.: Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University (2003)
Meuleau, N.L., Kim, K., Kaelbling, L.P.: Learning finite-state controllers for partially observable environments. In: Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI 1999), pp. 427–436. Morgan Kaufmann, San Francisco (1999)
Gomez, F.J., Schmidhuber, J.: Co-evolving recurrent neurons learn deep memory POMDPs. In: Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., ACM Press, New York (2005)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Werbos, P.: Back propagation through time: What it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990)
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent networks. Neural Computation 1(2), 270–280 (1989)
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, Los Alamitos (2001)
Schmidhuber, J.: RNN overview, with links to a dozen journal publications (2004) http://www.idsia.ch/~juergen/rnn.html
Littman, M., Cassandra, A., Kaelbling, L.: Learning policies for partially observable environments: Scaling up. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 362–370. Morgan Kaufmann Publishers, San Francisco, CA (1995)
Wieland, A.: Evolving neural network controllers for unstable systems. In: Proceedings of the International Joint Conference on Neural Networks, Seattle, WA, pp. 667–673. IEEE, Piscataway, NJ (1991)
Bakker, B.: Reinforcement learning with long short-term memory. Advances in Neural Information Processing Syst. 14 (2002)
Loch, J., Singh, S.: Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In: Proc. 15th International Conf. on Machine Learning, pp. 323–331. Morgan Kaufmann, San Francisco, CA (1998)
Bakker, B.: The State of Mind: Reinforcement Learning with Recurrent Neural Networks. PhD thesis, Leiden University (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wierstra, D., Schmidhuber, J. (2007). Policy Gradient Critics. In: Kok, J.N., Koronacki, J., Mantaras, R.L.d., Matwin, S., Mladenič, D., Skowron, A. (eds) Machine Learning: ECML 2007. ECML 2007. Lecture Notes in Computer Science(), vol 4701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74958-5_43
Download citation
DOI: https://doi.org/10.1007/978-3-540-74958-5_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74957-8
Online ISBN: 978-3-540-74958-5
eBook Packages: Computer ScienceComputer Science (R0)