Skip to main content
Log in

Solving Markov decision processes with downside risk adjustment

  • Research Anticle
  • Published:
International Journal of Automation and Computing Aims and scope Submit manuscript

Abstract

Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. R. S. Sutton, A. G. Barto. Reinforcement Learning: An Introduction, Cambridge, USA: The MIT Press, 1998.

    Google Scholar 

  2. D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming, Athena Scientific: Belmont, USA, 1996.

    MATH  Google Scholar 

  3. P. Balakrishna, R. Ganesan, L. Sherry. Accuracy of reinforcement learning algorithms for predicting aircraft taxiout times: A case-study of Tampa bay departures. Transportation Research Part C: Emerging Technologies, vol. 18, no. 6, pp. 950–962, 2010.

    Article  Google Scholar 

  4. Z. Sui, A. Gosavi, L. Lin. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory. Engineering Management Journal, vol. 22, no. 4, pp. 44–53, 2010.

    Article  Google Scholar 

  5. P. Abbeel, A. Coates, T. Hunter, A. Y. Ng. Autonomous autorotation of an RC helicopter. Experimental Robotics, O. Khatib, V. Kumar, G. J. Pappas, Eds., Berlin Heidelberg, Germany: Springer, pp. 385–394, 2009.

    Chapter  Google Scholar 

  6. R. A. Howard, J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, vol. 18, no. 7, pp. 356–369, 1972.

    Article  MathSciNet  MATH  Google Scholar 

  7. M. Rabin. Risk aversion and expected-utility theory: A calibration theorem. Econometrica, vol. 68, no. 5, pp. 1281–1292, 2000.

    Article  MathSciNet  Google Scholar 

  8. P. Whittle. Risk-sensitive Optimal Control, NY, USA: John Wiley, 1990.

    MATH  Google Scholar 

  9. J. A. Filar, L. C. M. Kallenberg, H. M. Lee. Variancepenalized Markov decision processes. Mathematics of Operations Research, vol. 14, no. 1, pp. 147–161, 1989.

    Article  MathSciNet  MATH  Google Scholar 

  10. M. J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982.

    Article  MathSciNet  MATH  Google Scholar 

  11. M. Sato, S. Kobayashi. Average-reward reinforcement learning for variance penalized Markov decision problems. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 473–480, 2001.

    Google Scholar 

  12. A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, vol. 42, no. 8, pp. 1321–1330, 2006.

    Article  MathSciNet  MATH  Google Scholar 

  13. A. Gosavi. Variance-penalized Markov decision processes: Dynamic programming and reinforcement learning techniques. International Journal of General Systems, vol. 43, no. 6, pp. 649–669, 2014.

    Article  MathSciNet  MATH  Google Scholar 

  14. A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of Winter Simulation Conference, IEEE, Austin, USA, pp. 373–379, 2009.

    Google Scholar 

  15. O. Mihatsch, R. Neuneier. Risk-sensitive reinforcement learning. Machine Learning, vol. 49, no. 2–3, pp. 267–290, 2002.

    Article  MATH  Google Scholar 

  16. P. Geibel. Reinforcement learning via bounded risk. In Proceedings of Internation Conference on Machine Learning, Morgan Kaufman, pp. 373–379, 2009.

    Google Scholar 

  17. M. Heger. Consideration of risk in reinforcement learning. In Proceedings of the 11th International Machine Learning Conference, Bellevue, USA, pp. 162–169, 2001.

    Google Scholar 

  18. Y. Chen, J. H. Jin. Cost-variability-sensitive preventive maintenance considering management risk. IIE Transactions, vol. 35, no. 12, pp. 1091–1102, 2003.

    Article  Google Scholar 

  19. C. Barz, K. H. Waldmann. Risk-sensitive capacity control in revenue management. Mathematical Methods of Operations Research, vol. 65, no. 3, pp. 565–579, 2007.

    Article  MathSciNet  MATH  Google Scholar 

  20. K. J. Chung, M. J. Sobel. Discounted MDPs: Distribution functions and exponential utility maximization. SIAM Journal of Control and Optimization, vol. 25, no. 1, pp. 49–62, 1987.

    Article  MathSciNet  MATH  Google Scholar 

  21. M. Bouakiz, Y. Kebir. Target-level criterion in Markov decision processes. Journal of Optimization Theory and Applications, vol. 86, no. 1, pp. 1–15, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  22. C. B. Wu, Y. L. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, vol. 231, no. 1, pp. 47–67, 1999.

    Article  MathSciNet  MATH  Google Scholar 

  23. A. Gosavi. Target-sensitive control of Markov and semi- Markov processes. International Journal of Control, Automation and Systems, vol. 9, no. 5, pp. 941–951, 2011.

    Article  MathSciNet  Google Scholar 

  24. A. A. Gosavi, S. K. Das, S. L. Murray. Beyond exponential utility functions: A variance-adjusted approach for riskaverse reinforcement learning. In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, Orlando, USA, pp. 1–8, 2014.

    Chapter  Google Scholar 

  25. D. P. Bertsekas. Dynamic Programming and Optimal Control, USA: Athena, 1995.

    MATH  Google Scholar 

  26. M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, USA: John Wiley & Sons, Inc., 1994.

    Book  MATH  Google Scholar 

  27. T. K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, vol. 45, no. 4, pp. 560–574, 1999.

    Article  MATH  Google Scholar 

  28. J. Baxter, P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence, vol. 15, pp. 319–350, 2001.

    MathSciNet  MATH  Google Scholar 

  29. A. Parulekar. A Downside Risk Criterion for Preventive Maintenance, Master dissertation, University at Buffalo, The State University of New York, 2006.

    Google Scholar 

  30. T. K. Das, S. Sarkar. Optimal preventive maintenance in a production inventory system. IIE Transactions, vol. 31, no. 6, pp. 537–551, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhijit Gosavi.

Additional information

Recommended by Guest Editor Dong-Ling Xu

Abhijit Gosavi received the B.Eng. degree in mechanical engineering from the Jadavpur University, India in 1992, and the M. Eng. degree in mechanical engineering from the Indian Institute of Technology, Madras, India in 1995. He received the Ph.D. degree in industrial engineering from University of South Florida in 1999. Currently, he is an associate professor in Department of Engineering Management and Systems Engineering at Missouri University of Science and Technology. He has published more than 60 refereed journal and conference papers. His research interests include simulation-based optimization, Markov decision processes, productive maintenance, and revenue management. He has received research funding awards from the National Science Foundation of the United Stated of America and numerous other agencies. He is a member of IIE, ASEM, and INFORMS.

His research interests include simulation-based optimization, Markov decision processes, productive maintenance, and revenue management.

ORCID ID: 0000-0002-9703-4076

Anish Parulekar received the B.Eng. degree in mechanical engineering from University of Mumbai, India in 2004, and the M. Sc. degree in industrial engineering from Department of Industrial and Systems Engineering at the University of Buffalo, State University of New York, USA in 2006. He currently serves as a deputy vice president and the head of Marketing Analytics in Axis Bank, Mumbai, India.

His research interests include risk, computing, and Markov control.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gosavi, A., Parulekar, A. Solving Markov decision processes with downside risk adjustment. Int. J. Autom. Comput. 13, 235–245 (2016). https://doi.org/10.1007/s11633-016-1005-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-016-1005-3

Keywords

Navigation