Abstract
Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.
Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.
Article PDF
Similar content being viewed by others
References
Basar, T. S., & Bernhard, P. (1995). H ? -optimal control and related minimax design problems: A dynamic game approach (2nd edn.). Boston: Birkhäuser.
Bellman, R. E., & Dreyfus, S. E. (1962). Applied dynamic programming. Princeton: Princeton University Press.
Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. 2.). Belmont, MA: Athena Scientific.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Coraluppi, S. (1997). Optimal control of Markov decision processes for performance and Robustness. Ph.D. Thesis, University of Maryland.
Coraluppi, S. P., & Marcus, S. I. (1999). Risk-sensitive, minimax and mixed risk-neutral/minimax control of Markov decision processes. In W. M. McEarney, G. G. Yin, & Q. Zhang (Eds.), Stochastic analysis, control, optimization and applications (pp. 21–40). Boston: Birkhäuser.
Elton, E. J., & Gruber, M. J. (1995). Modern portfolio theory and investment analysis. New York: John Wiley & Sons.
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis, & S. J. Russel (Eds.), Machine Learning: Proceedings of the Twelfth International Conference (pp. 261–268). San Francisco: Morgan Kaufmann Publishers.
Heger, M. (1994a). Consideration of risk and reinforcement learning. In W. W. Cohen, & H. Hirsh (Eds.), Machine Learning: Proceedings of the Eleventh International Conference (pp. 105–111). San Francisco: Morgan Kaufmann Publishers.
Heger, M. (1994b). Risk and reinforcement learning: Concepts and dynamic programming. Technical Report, Zentrum für Kognitionswissenschaften, Universität Bremen, Germany.
Howard, R. A., & Matheson, J. E. (1972). Risk-sensitive Markov decision processes. Management Science, 18:7, 356–369.
Koenig, S., & Simmons, R. G. (1994). Risk-sensitive planning with probabilistic decision graphs. In Proceedings of the Fourth International Conference on Principles of Knowledge Representation and Reasoning (KR) (pp. 363-373).
Littman, M. L., & Szepesvri, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In L. Saitta (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference (pp. 310–318). San Francisco: Morgan Kaufman Publishers.
Marbach, P., Mihatsch, O., & Tsitsiklis. J. N. (2000). Call admission control and routing in integrated services networks using neuro-dynamic programming. IEEE Journal on Selected Areas in Communications, 18:2, 197–208.
Neuneier, R. (1998). Enhancing Q-learning for optimal asset allocation. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems (Vol. 10). Cambridge, MA: The MIT Press.
Neuneier, R., & Mihatsch, O. (2000). Risk-averse asset allocation using reinforcement learning. In: Proceedings of the Seventh International Conference on Forecasting Financial Markets: Advances for Exchange Rates, Interest Rates and Asset Management.
Pratt, J. W. (1964). Risk aversion in the small and in the large. Econometrica, 32, 122–136.
Puterman, M. L. (1994). Markov decision processes. New York: John Wiley & Sons.
Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in neural information processing systems (Vol. 9, pp. 974–980). Cambridge, MA: The MIT Press.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:5, 674–690.
Tsitsiklis, J. N., & Van Roy, B. (1999). Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing financial derivatives. IEEE Transactions on Automatic Control, 44:10, 1840–1851.
von Neumann, J., & Morgenstern, O. (1953). Theory of games and economic behavior (3rd edn.). Princeton University Press.
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. Thesis, University of Cambridge, England.
Zhang, W., & Dietterich, T. G. (1996). High-performance job-shop scheduling with a time-delay TD(?) network. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 1024–1030). Cambridge, MA: The MIT Press.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mihatsch, O., Neuneier, R. Risk-Sensitive Reinforcement Learning. Machine Learning 49, 267–290 (2002). https://doi.org/10.1023/A:1017940631555
Issue Date:
DOI: https://doi.org/10.1023/A:1017940631555