Risk-Sensitive Reinforcement Learning

Mihatsch, Oliver; Neuneier, Ralph

doi:10.1023/A:1017940631555

Risk-Sensitive Reinforcement Learning

Published: November 2002

Volume 49, pages 267–290, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Risk-Sensitive Reinforcement Learning

Download PDF

Oliver Mihatsch¹ &
Ralph Neuneier¹

6059 Accesses
121 Citations
1 Altmetric
Explore all metrics

Abstract

Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.

Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.

References

Basar, T. S., & Bernhard, P. (1995). H ^? -optimal control and related minimax design problems: A dynamic game approach (2nd edn.). Boston: Birkhäuser.
Google Scholar
Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton: Princeton University Press.
Google Scholar
Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 2.). Belmont, MA: Athena Scientific.
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Google Scholar
Coraluppi, S. (1997). Optimal control of Markov decision processes for performance and Robustness. Ph.D. Thesis, University of Maryland.
Coraluppi, S. P., & Marcus, S. I. (1999). Risk-sensitive, minimax and mixed risk-neutral/minimax control of Markov decision processes. In W. M. McEarney, G. G. Yin, & Q. Zhang (Eds.), Stochastic analysis, control, optimization and applications (pp. 21–40). Boston: Birkhäuser.
Google Scholar
Elton, E. J., & Gruber, M. J. (1995). Modern portfolio theory and investment analysis. New York: John Wiley & Sons.
Google Scholar
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S. J. Russel (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (pp. 261–268). San Francisco: Morgan Kaufmann Publishers.
Google Scholar
Heger, M. (1994a). Consideration of risk and reinforcement learning. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 105–111). San Francisco: Morgan Kaufmann Publishers.
Google Scholar
Heger, M. (1994b). Risk and reinforcement learning: Concepts and dynamic programming. Technical Report, Zentrum für Kognitionswissenschaften, Universität Bremen, Germany.
Google Scholar
Howard, R. A., & Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management Science, 18:7, 356–369.
Google Scholar
Koenig, S., & Simmons, R. G. (1994). Risk-sensitive planning with probabilistic decision graphs. In Proceedings of the Fourth International Conference on Principles of Knowledge Representation and Reasoning (KR) (pp. 363-373).
Littman, M. L., & Szepesvri, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference (pp. 310–318). San Francisco: Morgan Kaufman Publishers.
Google Scholar
Marbach, P., Mihatsch, O., & Tsitsiklis. J. N. (2000). Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE Journal on Selected Areas in Communications, 18:2, 197–208.
Google Scholar
Neuneier, R. (1998). Enhancing Q-learning for optimal asset allocation. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems (Vol. 10). Cambridge, MA: The MIT Press.
Google Scholar
Neuneier, R., & Mihatsch, O. (2000). Risk-averse asset allocation using reinforcement learning. In: Proceedings of the Seventh International Conference on Forecasting Financial Markets: Advances for Exchange Rates, Interest Rates and Asset Management.
Pratt, J. W. (1964). Risk aversion in the small and in the large. Econometrica, 32, 122–136.
Google Scholar
Puterman, M. L. (1994). Markov decision processes. New York: John Wiley & Sons.
Google Scholar
Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 974–980). Cambridge, MA: The MIT Press.
Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:5, 674–690.
Google Scholar
Tsitsiklis, J. N., & Van Roy, B. (1999). Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing financial derivatives. IEEE Transactions on Automatic Control, 44:10, 1840–1851.
Google Scholar
von Neumann, J., & Morgenstern, O. (1953). Theory of games and economic behavior (3rd edn.). Princeton University Press.
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge, England.
Google Scholar
Zhang, W., & Dietterich, T. G. (1996). High-performance job-shop scheduling with a time-delay TD(?) network. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 1024–1030). Cambridge, MA: The MIT Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Corporate Technology, Information and Communications 4, Siemens AG, D-81730, Munich, Germany
Oliver Mihatsch & Ralph Neuneier

Authors

Oliver Mihatsch
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Neuneier
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mihatsch, O., Neuneier, R. Risk-Sensitive Reinforcement Learning. Machine Learning 49, 267–290 (2002). https://doi.org/10.1023/A:1017940631555

Download citation

Issue Date: November 2002
DOI: https://doi.org/10.1023/A:1017940631555

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Risk-Sensitive Reinforcement Learning

Abstract

Article PDF

Similar content being viewed by others

A brief review of portfolio optimization techniques

A practical guide to multi-objective reinforcement learning and planning

Machine learning for financial forecasting, planning and analysis: recent developments and pitfalls

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Risk-Sensitive Reinforcement Learning

Abstract

Article PDF

Similar content being viewed by others

A brief review of portfolio optimization techniques

A practical guide to multi-objective reinforcement learning and planning

Machine learning for financial forecasting, planning and analysis: recent developments and pitfalls

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation