Policy Gradient Critics

  • Daan Wierstra
  • Jürgen Schmidhuber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4701)


We present Policy Gradient Actor-Critic (PGAC), a new model-free Reinforcement Learning (RL) method for creating limited-memory stochastic policies for Partially Observable Markov Decision Processes (POMDPs) that require long-term memories of past observations and actions. The approach involves estimating a policy gradient for an Actor through a Policy Gradient Critic which evaluates probability distributions on actions. Gradient-based updates of history-conditional action probability distributions enable the algorithm to learn a mapping from memory states (or event histories) to probability distributions on actions, solving POMDPs through a combination of memory and stochasticity. This goes beyond previous approaches to learning purely reactive POMDP policies, without giving up their advantages. Preliminary results on important benchmark tasks show that our approach can in principle be used as a general purpose POMDP algorithm that solves RL problems in both continuous and discrete action domains.


Reinforcement Learn Recurrent Neural Network Partially Observable Markov Decision Process Reinforcement Learn Algorithm Reinforcement Learn Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Sutton, R., Barto, A.: Reinforcement learning: An introduction. MIT Press, Cambridge, MA (1998)Google Scholar
  2. 2.
    Singh, S., Jaakkola, T., Jordan, M.: Learning without state-estimation in partially observable markovian decision processes. In: International Conference on Machine Learning, pp. 284–292 (1994)Google Scholar
  3. 3.
    Aberdeen, D.: Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University (2003)Google Scholar
  4. 4.
    Meuleau, N.L., Kim, K., Kaelbling, L.P.: Learning finite-state controllers for partially observable environments. In: Proc. Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI 1999), pp. 427–436. Morgan Kaufmann, San Francisco (1999)Google Scholar
  5. 5.
    Gomez, F.J., Schmidhuber, J.: Co-evolving recurrent neurons learn deep memory POMDPs. In: Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., ACM Press, New York (2005)Google Scholar
  6. 6.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  7. 7.
    Werbos, P.: Back propagation through time: What it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990)CrossRefGoogle Scholar
  8. 8.
    Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent networks. Neural Computation 1(2), 270–280 (1989)CrossRefGoogle Scholar
  9. 9.
    Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks, IEEE Press, Los Alamitos (2001)Google Scholar
  10. 10.
    Schmidhuber, J.: RNN overview, with links to a dozen journal publications (2004)
  11. 11.
    Littman, M., Cassandra, A., Kaelbling, L.: Learning policies for partially observable environments: Scaling up. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth International Conference, pp. 362–370. Morgan Kaufmann Publishers, San Francisco, CA (1995)Google Scholar
  12. 12.
    Wieland, A.: Evolving neural network controllers for unstable systems. In: Proceedings of the International Joint Conference on Neural Networks, Seattle, WA, pp. 667–673. IEEE, Piscataway, NJ (1991)Google Scholar
  13. 13.
    Bakker, B.: Reinforcement learning with long short-term memory. Advances in Neural Information Processing Syst. 14 (2002)Google Scholar
  14. 14.
    Loch, J., Singh, S.: Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In: Proc. 15th International Conf. on Machine Learning, pp. 323–331. Morgan Kaufmann, San Francisco, CA (1998)Google Scholar
  15. 15.
    Bakker, B.: The State of Mind: Reinforcement Learning with Recurrent Neural Networks. PhD thesis, Leiden University (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Daan Wierstra
    • 1
  • Jürgen Schmidhuber
    • 1
    • 2
  1. 1.Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), CH-6928 Manno-LuganoSwitzerland
  2. 2.Department of Embedded Systems and Robotics, Technical University Munich, D-85748 GarchingGermany

Personalised recommendations