Risk-Sensitivity in Simulation Based Online Planning

  • Kyrill Schmid
  • Lenz Belzner
  • Marie Kiermeier
  • Alexander Neitz
  • Thomy Phan
  • Thomas Gabor
  • Claudia Linnhoff
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11117)


Making decisions under risk is a competence human beings naturally display when being confronted with new and potentially dangerous learning tasks. In an effort to replicate this ability, many approaches have been promoted in different fields of artificial learning and planning. To plan domains with inherent risk in the presence of a simulation model we propose Risk-Sensitive Online Planning (RISEON) that extends traditional online planning by using an appropriate risk-aware optimization objective. The objective we use is Conditional Value at Risk (CVaR), where risk-sensitivity can be controlled by setting the quantile size to fit a given risk level. By using CVaR the planner shifts its focus from risk-neutral sample means towards the tail of loss distributions, thus considers an adjustable share of high costs. We evaluate RISEON in a smart grid planning scenario and in a continuous control task, where the planner has to steer a vehicle towards risky checkboxes, and empirically show that the proposed algorithm can be used to plan w.r.t. risk-sensitivity.


Online planning Risk-sensitivity Local planning 


  1. 1.
    Galichet, N., Sebag, M., Teytaud, O.: Exploration vs exploitation vs safety: risk-aware multi-armed bandits. In: ACML, pp. 245–260 (2013)Google Scholar
  2. 2.
    Heger, M.: Consideration of risk in reinforcement learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 105–111 (1994)Google Scholar
  3. 3.
    Howard, R.A., Matheson, J.E.: Risk-sensitive Markov decision processes. Manag. Sci. 18(7), 356–369 (1972)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Kisiala, J.: Conditional value-at-risk: theory and applications. arXiv preprint arXiv:1511.00140 (2015)
  5. 5.
    Moldovan, T.M.: Safety, risk awareness and exploration in reinforcement learning. Ph.D. thesis, University of California, Berkeley (2014)Google Scholar
  6. 6.
    Garcıa, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)MathSciNetzbMATHGoogle Scholar
  7. 7.
    Belzner, L., Hennicker, R., Wirsing, M.: OnPlan: a framework for simulation-based online planning. In: Braga, C., Ölveczky, P.C. (eds.) FACS 2015. LNCS, vol. 9539, pp. 1–30. Springer, Cham (2016). Scholar
  8. 8.
    Howard, R.A.: Dynamic Programming and Markov Processes. Wiley for The Massachusetts Institute of Technology, New York (1964)Google Scholar
  9. 9.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (2014)zbMATHGoogle Scholar
  10. 10.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT press, Cambridge (1998)Google Scholar
  11. 11.
    Weinstein, A.: Local planning for continuous Markov decision processes. Rutgers The State University of New Jersey-New Brunswick (2014)Google Scholar
  12. 12.
    Weinstein, A., Littman, M.L.: Open-loop planning in large-scale stochastic domains. In: AAAI (2013)Google Scholar
  13. 13.
    Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006). Scholar
  14. 14.
    De Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134(1), 19–67 (2005)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Belzner, L.: Time-adaptive cross entropy planning. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing, pp. 254–259. ACM (2016)Google Scholar
  16. 16.
    Liu, Y.: Decision-theoretic planning under risk-sensitive planning objectives. Ph.D. thesis, Georgia Institute of Technology (2005)Google Scholar
  17. 17.
    Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)zbMATHGoogle Scholar
  18. 18.
    Chung, K.J., Sobel, M.J.: Discounted MDP’s: distribution functions and exponential utility maximization. SIAM J. Control Optim. 25(1), 49–62 (1987)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Moldovan, T.M., Abbeel, P.: Risk aversion in Markov decision processes via near optimal Chernoff bounds. In: NIPS, pp. 3140–3148 (2012)Google Scholar
  20. 20.
    Kashima, H.: Risk-sensitive learning via minimization of empirical conditional value-at-risk. IEICE Trans. Inf. Syst. 90(12), 2043–2052 (2007)CrossRefGoogle Scholar
  21. 21.
    Rockafellar, R.T., Uryasev, S., et al.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000)CrossRefGoogle Scholar
  22. 22.
    Chen, S.X.: Nonparametric estimation of expected shortfall. J. Financ. Econom. 6(1), 87–107 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Kyrill Schmid
    • 1
  • Lenz Belzner
    • 1
  • Marie Kiermeier
    • 1
  • Alexander Neitz
    • 2
  • Thomy Phan
    • 1
  • Thomas Gabor
    • 1
  • Claudia Linnhoff
    • 1
  1. 1.Mobile and Distributed Systems GroupLMU MunichMunichGermany
  2. 2.Empirical InferenceMax Planck Institute for Intelligent SystemsTübingenGermany

Personalised recommendations