Skip to main content

Episodic Reinforcement Learning by Logistic Reward-Weighted Regression

  • Conference paper
Artificial Neural Networks - ICANN 2008 (ICANN 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5163))

Included in the following conference series:


It has been a long-standing goal in the adaptive control community to reduce the generically difficult, general reinforcement learning (RL) problem to simpler problems solvable by supervised learning. While this approach is today’s standard for value function-based methods, fewer approaches are known that apply similar reductions to policy search methods. Recently, it has been shown that immediate RL problems can be solved by reward-weighted regression, and that the resulting algorithm is an expectation maximization (EM) algorithm with strong guarantees. In this paper, we extend this algorithm to the episodic case and show that it can be used in the context of LSTM recurrent neural networks (RNNs). The resulting RNN training algorithm is equivalent to a weighted self-modeling supervised learning technique. We focus on partially observable Markov decision problems (POMDPs) where it is essential that the policy is nonstationary in order to be optimal. We show that this new reward-weighted logistic regression used in conjunction with an RNN architecture can solve standard benchmark POMDPs with ease.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 149.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial Intelligence 101 (1998)

    Google Scholar 

  2. Aoki, M.: Optimization of Stochastic Systems. Academic Press, New York (1967)

    MATH  Google Scholar 

  3. Baxter, J., Bartlett, P.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  4. Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J.: Solving deep memory pomdps with recurrent policy gradients. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 697–706. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  5. Peters, J., Schaal, S.: Reinforcement learning by reward-weighted regression for operational space control. In: Proceedings of the International Conference on Machine Learning (ICML) (2007)

    Google Scholar 

  6. Dayan, P., Hinton, G.E.: Using expectation-maximization for reinforcement learning. Neural Computation 9(2), 271–278 (1997)

    Article  MATH  Google Scholar 

  7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  8. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, Los Alamitos (2001)

    Google Scholar 

  9. Schmidhuber, J.: RNN overview (2004),

  10. Werbos, P.: Back propagation through time: What it does and how to do it. Proceedings of the IEEE 78, 1550–1560 (1990)

    Article  Google Scholar 

  11. Chernoff, H., Moses, L.E.: Elementary Decision Theory. Dover Publications (1987)

    Google Scholar 

  12. Kleinbaum, D.G., Klein, M., Pryor, E.R.: Logistic Regression, 2nd edn. Springer, Heidelberg (2002)

    MATH  Google Scholar 

  13. James, M.R., Singh, S., Littman, M.L.: Planning with predictive state representations. In: Proceedings 2004 International Conference on Machine Learning and Applications, pp. 304–311 (2004)

    Google Scholar 

  14. Bowling, M., McCracken, P., James, M., Neufeld, J., Wilkinson, D.: Learning predictive state representations using non-blind policies. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, pp. 129–136. ACM, New York (2006)

    Chapter  Google Scholar 

  15. Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural Information Processing Syst., vol. 14 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Véra Kůrková Roman Neruda Jan Koutník

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J. (2008). Episodic Reinforcement Learning by Logistic Reward-Weighted Regression. In: Kůrková, V., Neruda, R., Koutník, J. (eds) Artificial Neural Networks - ICANN 2008. ICANN 2008. Lecture Notes in Computer Science, vol 5163. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87535-2

  • Online ISBN: 978-3-540-87536-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics