Episodic Reinforcement Learning by Logistic Reward-Weighted Regression

  • Daan Wierstra
  • Tom Schaul
  • Jan Peters
  • Juergen Schmidhuber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5163)


It has been a long-standing goal in the adaptive control community to reduce the generically difficult, general reinforcement learning (RL) problem to simpler problems solvable by supervised learning. While this approach is today’s standard for value function-based methods, fewer approaches are known that apply similar reductions to policy search methods. Recently, it has been shown that immediate RL problems can be solved by reward-weighted regression, and that the resulting algorithm is an expectation maximization (EM) algorithm with strong guarantees. In this paper, we extend this algorithm to the episodic case and show that it can be used in the context of LSTM recurrent neural networks (RNNs). The resulting RNN training algorithm is equivalent to a weighted self-modeling supervised learning technique. We focus on partially observable Markov decision problems (POMDPs) where it is essential that the policy is nonstationary in order to be optimal. We show that this new reward-weighted logistic regression used in conjunction with an RNN architecture can solve standard benchmark POMDPs with ease.


Expectation Maximization Reinforcement Learning Recurrent Neural Network Expectation Maximization Algorithm Reinforcement Learning Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial Intelligence 101 (1998)Google Scholar
  2. 2.
    Aoki, M.: Optimization of Stochastic Systems. Academic Press, New York (1967)zbMATHGoogle Scholar
  3. 3.
    Baxter, J., Bartlett, P.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J.: Solving deep memory pomdps with recurrent policy gradients. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 697–706. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Peters, J., Schaal, S.: Reinforcement learning by reward-weighted regression for operational space control. In: Proceedings of the International Conference on Machine Learning (ICML) (2007)Google Scholar
  6. 6.
    Dayan, P., Hinton, G.E.: Using expectation-maximization for reinforcement learning. Neural Computation 9(2), 271–278 (1997)zbMATHCrossRefGoogle Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, Los Alamitos (2001)Google Scholar
  9. 9.
    Schmidhuber, J.: RNN overview (2004),
  10. 10.
    Werbos, P.: Back propagation through time: What it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990)CrossRefGoogle Scholar
  11. 11.
    Chernoff, H., Moses, L.E.: Elementary Decision Theory. Dover Publications (1987)Google Scholar
  12. 12.
    Kleinbaum, D.G., Klein, M., Pryor, E.R.: Logistic Regression, 2nd edn. Springer, Heidelberg (2002)zbMATHGoogle Scholar
  13. 13.
    James, M.R., Singh, S., Littman, M.L.: Planning with predictive state representations. In: Proceedings 2004 International Conference on Machine Learning and Applications, pp. 304–311 (2004)Google Scholar
  14. 14.
    Bowling, M., McCracken, P., James, M., Neufeld, J., Wilkinson, D.: Learning predictive state representations using non-blind policies. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, pp. 129–136. ACM, New York (2006)CrossRefGoogle Scholar
  15. 15.
    Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural Information Processing Syst., vol. 14 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Daan Wierstra
    • 1
  • Tom Schaul
    • 1
  • Jan Peters
    • 2
  • Juergen Schmidhuber
    • 1
    • 3
  1. 1.IDSIAManno-LuganoSwitzerland
  2. 2.MPI for Biological CyberneticsTübingenGermany
  3. 3.Technical University MunichGarchingGermany

Personalised recommendations