Advertisement

Machine Learning

, Volume 49, Issue 2–3, pp 267–290 | Cite as

Risk-Sensitive Reinforcement Learning

  • Oliver Mihatsch
  • Ralph Neuneier
Article

Abstract

Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.

Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.

reinforcement learning risk-sensitive control temporal differences dynamic programming Bellman's equation 

References

  1. Basar, T. S., & Bernhard, P. (1995). H ? -optimal control and related minimax design problems: A dynamic game approach (2nd edn.). Boston: Birkhäuser.Google Scholar
  2. Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton: Princeton University Press.Google Scholar
  3. Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 2.). Belmont, MA: Athena Scientific.Google Scholar
  4. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.Google Scholar
  5. Coraluppi, S. (1997). Optimal control of Markov decision processes for performance and Robustness. Ph.D. Thesis, University of Maryland.Google Scholar
  6. Coraluppi, S. P., & Marcus, S. I. (1999). Risk-sensitive, minimax and mixed risk-neutral/minimax control of Markov decision processes. In W. M. McEarney, G. G. Yin, & Q. Zhang (Eds.), Stochastic analysis, control, optimization and applications (pp. 21–40). Boston: Birkhäuser.Google Scholar
  7. Elton, E. J., & Gruber, M. J. (1995). Modern portfolio theory and investment analysis. New York: John Wiley & Sons.Google Scholar
  8. Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S. J. Russel (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (pp. 261–268). San Francisco: Morgan Kaufmann Publishers.Google Scholar
  9. Heger, M. (1994a). Consideration of risk and reinforcement learning. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 105–111). San Francisco: Morgan Kaufmann Publishers.Google Scholar
  10. Heger, M. (1994b). Risk and reinforcement learning: Concepts and dynamic programming. Technical Report, Zentrum für Kognitionswissenschaften, Universität Bremen, Germany.Google Scholar
  11. Howard, R. A., & Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management Science, 18:7, 356–369.Google Scholar
  12. Koenig, S., & Simmons, R. G. (1994). Risk-sensitive planning with probabilistic decision graphs. In Proceedings of the Fourth International Conference on Principles of Knowledge Representation and Reasoning (KR) (pp. 363-373).Google Scholar
  13. Littman, M. L., & Szepesvri, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference (pp. 310–318). San Francisco: Morgan Kaufman Publishers.Google Scholar
  14. Marbach, P., Mihatsch, O., & Tsitsiklis. J. N. (2000). Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE Journal on Selected Areas in Communications, 18:2, 197–208.Google Scholar
  15. Neuneier, R. (1998). Enhancing Q-learning for optimal asset allocation. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems (Vol. 10). Cambridge, MA: The MIT Press.Google Scholar
  16. Neuneier, R., & Mihatsch, O. (2000). Risk-averse asset allocation using reinforcement learning. In: Proceedings of the Seventh International Conference on Forecasting Financial Markets: Advances for Exchange Rates, Interest Rates and Asset Management.Google Scholar
  17. Pratt, J. W. (1964). Risk aversion in the small and in the large. Econometrica, 32, 122–136.Google Scholar
  18. Puterman, M. L. (1994). Markov decision processes. New York: John Wiley & Sons.Google Scholar
  19. Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 974–980). Cambridge, MA: The MIT Press.Google Scholar
  20. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
  21. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:5, 674–690.Google Scholar
  22. Tsitsiklis, J. N., & Van Roy, B. (1999). Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing financial derivatives. IEEE Transactions on Automatic Control, 44:10, 1840–1851.Google Scholar
  23. von Neumann, J., & Morgenstern, O. (1953). Theory of games and economic behavior (3rd edn.). Princeton University Press.Google Scholar
  24. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge, England.Google Scholar
  25. Zhang, W., & Dietterich, T. G. (1996). High-performance job-shop scheduling with a time-delay TD(?) network. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 1024–1030). Cambridge, MA: The MIT Press.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Oliver Mihatsch
    • 1
  • Ralph Neuneier
    • 1
  1. 1.Corporate Technology, Information and Communications 4Siemens AGMunichGermany

Personalised recommendations