Solving Markov decision processes with downside risk adjustment

Gosavi, Abhijit; Parulekar, Anish

doi:10.1007/s11633-016-1005-3

Solving Markov decision processes with downside risk adjustment

Research Anticle
Published: 11 June 2016

Volume 13, pages 235–245, (2016)
Cite this article

International Journal of Automation and Computing Aims and scope Submit manuscript

Abhijit Gosavi¹ &
Anish Parulekar²

107 Accesses
2 Citations
Explore all metrics

Abstract

Markov decision processes (MDPs) and their variants are widely studied in the theory of controls for stochastic discreteevent systems driven by Markov chains. Much of the literature focusses on the risk-neutral criterion in which the expected rewards, either average or discounted, are maximized. There exists some literature on MDPs that takes risks into account. Much of this addresses the exponential utility (EU) function and mechanisms to penalize different forms of variance of the rewards. EU functions have some numerical deficiencies, while variance measures variability both above and below the mean rewards; the variability above mean rewards is usually beneficial and should not be penalized/avoided. As such, risk metrics that account for pre-specified targets (thresholds) for rewards have been considered in the literature, where the goal is to penalize the risks of revenues falling below those targets. Existing work on MDPs that takes targets into account seeks to minimize risks of this nature. Minimizing risks can lead to poor solutions where the risk is zero or near zero, but the average rewards are also rather low. In this paper, hence, we study a risk-averse criterion, in particular the so-called downside risk, which equals the probability of the revenues falling below a given target, where, in contrast to minimizing such risks, we only reduce this risk at the cost of slightly lowered average rewards. A solution where the risk is low and the average reward is quite high, although not at its maximum attainable value, is very attractive in practice. To be more specific, in our formulation, the objective function is the expected value of the rewards minus a scalar times the downside risk. In this setting, we analyze the infinite horizon MDP, the finite horizon MDP, and the infinite horizon semi-MDP (SMDP). We develop dynamic programming and reinforcement learning algorithms for the finite and infinite horizon. The algorithms are tested in numerical studies and show encouraging performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

Article 22 April 2021

References

R. S. Sutton, A. G. Barto. Reinforcement Learning: An Introduction, Cambridge, USA: The MIT Press, 1998.
Google Scholar
D. P. Bertsekas, J. N. Tsitsiklis. Neuro-dynamic Programming, Athena Scientific: Belmont, USA, 1996.
MATH Google Scholar
P. Balakrishna, R. Ganesan, L. Sherry. Accuracy of reinforcement learning algorithms for predicting aircraft taxiout times: A case-study of Tampa bay departures. Transportation Research Part C: Emerging Technologies, vol. 18, no. 6, pp. 950–962, 2010.
Article Google Scholar
Z. Sui, A. Gosavi, L. Lin. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory. Engineering Management Journal, vol. 22, no. 4, pp. 44–53, 2010.
Article Google Scholar
P. Abbeel, A. Coates, T. Hunter, A. Y. Ng. Autonomous autorotation of an RC helicopter. Experimental Robotics, O. Khatib, V. Kumar, G. J. Pappas, Eds., Berlin Heidelberg, Germany: Springer, pp. 385–394, 2009.
Chapter Google Scholar
R. A. Howard, J. E. Matheson. Risk-sensitive Markov decision processes. Management Science, vol. 18, no. 7, pp. 356–369, 1972.
Article MathSciNet MATH Google Scholar
M. Rabin. Risk aversion and expected-utility theory: A calibration theorem. Econometrica, vol. 68, no. 5, pp. 1281–1292, 2000.
Article MathSciNet Google Scholar
P. Whittle. Risk-sensitive Optimal Control, NY, USA: John Wiley, 1990.
MATH Google Scholar
J. A. Filar, L. C. M. Kallenberg, H. M. Lee. Variancepenalized Markov decision processes. Mathematics of Operations Research, vol. 14, no. 1, pp. 147–161, 1989.
Article MathSciNet MATH Google Scholar
M. J. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, vol. 19, no. 4, pp. 794–802, 1982.
Article MathSciNet MATH Google Scholar
M. Sato, S. Kobayashi. Average-reward reinforcement learning for variance penalized Markov decision problems. In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, USA, pp. 473–480, 2001.
Google Scholar
A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, vol. 42, no. 8, pp. 1321–1330, 2006.
Article MathSciNet MATH Google Scholar
A. Gosavi. Variance-penalized Markov decision processes: Dynamic programming and reinforcement learning techniques. International Journal of General Systems, vol. 43, no. 6, pp. 649–669, 2014.
Article MathSciNet MATH Google Scholar
A. Gosavi. Reinforcement learning for model building and variance-penalized control. In Proceedings of Winter Simulation Conference, IEEE, Austin, USA, pp. 373–379, 2009.
Google Scholar
O. Mihatsch, R. Neuneier. Risk-sensitive reinforcement learning. Machine Learning, vol. 49, no. 2–3, pp. 267–290, 2002.
Article MATH Google Scholar
P. Geibel. Reinforcement learning via bounded risk. In Proceedings of Internation Conference on Machine Learning, Morgan Kaufman, pp. 373–379, 2009.
Google Scholar
M. Heger. Consideration of risk in reinforcement learning. In Proceedings of the 11th International Machine Learning Conference, Bellevue, USA, pp. 162–169, 2001.
Google Scholar
Y. Chen, J. H. Jin. Cost-variability-sensitive preventive maintenance considering management risk. IIE Transactions, vol. 35, no. 12, pp. 1091–1102, 2003.
Article Google Scholar
C. Barz, K. H. Waldmann. Risk-sensitive capacity control in revenue management. Mathematical Methods of Operations Research, vol. 65, no. 3, pp. 565–579, 2007.
Article MathSciNet MATH Google Scholar
K. J. Chung, M. J. Sobel. Discounted MDPs: Distribution functions and exponential utility maximization. SIAM Journal of Control and Optimization, vol. 25, no. 1, pp. 49–62, 1987.
Article MathSciNet MATH Google Scholar
M. Bouakiz, Y. Kebir. Target-level criterion in Markov decision processes. Journal of Optimization Theory and Applications, vol. 86, no. 1, pp. 1–15, 1995.
Article MathSciNet MATH Google Scholar
C. B. Wu, Y. L. Lin. Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, vol. 231, no. 1, pp. 47–67, 1999.
Article MathSciNet MATH Google Scholar
A. Gosavi. Target-sensitive control of Markov and semi- Markov processes. International Journal of Control, Automation and Systems, vol. 9, no. 5, pp. 941–951, 2011.
Article MathSciNet Google Scholar
A. A. Gosavi, S. K. Das, S. L. Murray. Beyond exponential utility functions: A variance-adjusted approach for riskaverse reinforcement learning. In Proceedings of the 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), IEEE, Orlando, USA, pp. 1–8, 2014.
Chapter Google Scholar
D. P. Bertsekas. Dynamic Programming and Optimal Control, USA: Athena, 1995.
MATH Google Scholar
M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, USA: John Wiley & Sons, Inc., 1994.
Book MATH Google Scholar
T. K. Das, A. Gosavi, S. Mahadevan, N. Marchalleck. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, vol. 45, no. 4, pp. 560–574, 1999.
Article MATH Google Scholar
J. Baxter, P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence, vol. 15, pp. 319–350, 2001.
MathSciNet MATH Google Scholar
A. Parulekar. A Downside Risk Criterion for Preventive Maintenance, Master dissertation, University at Buffalo, The State University of New York, 2006.
Google Scholar
T. K. Das, S. Sarkar. Optimal preventive maintenance in a production inventory system. IIE Transactions, vol. 31, no. 6, pp. 537–551, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

219 Engineering Management Building, Department of Engineering Management and Systems Engineering, Missouri University of Science and Technology, Rolla, MO, 65409, USA
Abhijit Gosavi
Axis Bank, Mumbai, India
Anish Parulekar

Authors

Abhijit Gosavi
View author publications
You can also search for this author in PubMed Google Scholar
Anish Parulekar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhijit Gosavi.

Additional information

Recommended by Guest Editor Dong-Ling Xu

Abhijit Gosavi received the B.Eng. degree in mechanical engineering from the Jadavpur University, India in 1992, and the M. Eng. degree in mechanical engineering from the Indian Institute of Technology, Madras, India in 1995. He received the Ph.D. degree in industrial engineering from University of South Florida in 1999. Currently, he is an associate professor in Department of Engineering Management and Systems Engineering at Missouri University of Science and Technology. He has published more than 60 refereed journal and conference papers. His research interests include simulation-based optimization, Markov decision processes, productive maintenance, and revenue management. He has received research funding awards from the National Science Foundation of the United Stated of America and numerous other agencies. He is a member of IIE, ASEM, and INFORMS.

His research interests include simulation-based optimization, Markov decision processes, productive maintenance, and revenue management.

ORCID ID: 0000-0002-9703-4076

Anish Parulekar received the B.Eng. degree in mechanical engineering from University of Mumbai, India in 2004, and the M. Sc. degree in industrial engineering from Department of Industrial and Systems Engineering at the University of Buffalo, State University of New York, USA in 2006. He currently serves as a deputy vice president and the head of Marketing Analytics in Axis Bank, Mumbai, India.

His research interests include risk, computing, and Markov control.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gosavi, A., Parulekar, A. Solving Markov decision processes with downside risk adjustment. Int. J. Autom. Comput. 13, 235–245 (2016). https://doi.org/10.1007/s11633-016-1005-3

Download citation

Received: 13 March 2015
Accepted: 18 December 2015
Published: 11 June 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s11633-016-1005-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Solving Markov decision processes with downside risk adjustment

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Solving Markov decision processes with downside risk adjustment

Abstract

Access this article

Similar content being viewed by others

A practical guide to multi-objective reinforcement learning and planning

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

Challenges of real-world reinforcement learning: definitions, benchmarks and analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation