A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Gosavi, Abhijit

doi:10.1023/B:MACH.0000019802.64038.6c

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Published: April 2004

Volume 55, pages 5–29, (2004)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Download PDF

Abhijit Gosavi¹

2134 Accesses
61 Citations
Explore all metrics

Abstract

We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.

Article PDF

Reinforcement Learning

Average Reward Reinforcement Learning for Semi-Markov Decision Processes

Introduction to Reinforcement Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abounadi, J., Bertsekas. D., & Borkar, V. (1998). Learning algorithms for Markov decision processes with average cost. Technical Report, LIDS-P-2434, MIT, MA.
Google Scholar
Belobaba, P. (1989). Application of a probabilistic decision model to airline seat inventory control. Operations Research, 37, 183–197.
Google Scholar
Bertsekas, D. (1995a). Dynamic programming and optimal control. Belmont, MA: Athena Scientific.
Google Scholar
Bertsekas, D. (1995b). A new value iteration method for the average cost dynamic programming problem. Technical Report LIDS-P-2307, MIT, MA.
Google Scholar
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Google Scholar
Borkar, V. (1998). Asynchronous stochastic approximation. SIAM Journal of Control and Optimization, 36:3, 840–851.
Google Scholar
Bradtke, S., & Duff, M. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press.
Google Scholar
Brumelle, S., & McGill, J. (1993). Airline Seat Allocation With Multiple Nested Fare Classes Operations Research, 41, 127–137.
Google Scholar
Chatwin, R. (1998). Multiperiod Airline Overbooking With A Single Fare Class. Operations Research, 46:6, 805–819.
Google Scholar
Curry, R. (1990). Optimal airline seat allocation with fare classes nested by origins and desitnations. Transportation Science, 24, 193–204.
Google Scholar
Das, T., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45:3, 560–574.
Google Scholar
Davis, P. (1994). Airline ties profitability yield to management. SIAM News, 27:5.
Glover, F., Glover, R., Lorenzo, J., & McMillan, C. (1982). The passenger-mix problem in the scheduled airlines. Interfaces, 12, 73–80.
Google Scholar
Gosavi, A. (1999). An Algorithm for Solving semi-Markov Decision Problems Using Reinforcement Learning: Convergence Analysis and Numerical Results. Unpublished Ph.D. Dissertation, Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, FL.
Gosavi, A. (2003). A Reinforcement Learning Algorithm for Solving Markov and Semi-Markov Decision Problems Under Long-Run Average Cost. European Journal of Operational Research, to appear.
Gosavi, A., Bandla, N., & Das, T. (2002). Airline seat allocation among multiple fare classes with overbooking. IIE Transactions, 34:9, 729–742.
Google Scholar
Horner, P. (2000). Mother, father of invention produce golden child: Revenue management—A Sabre story. ORMS Today, 27:3, 75–77.
Google Scholar
Howard, R. (1971). Dynamic probabilistic systems Volume II: Semi-Markov decision processes. New York, NY: John Wiley and Sons.
Google Scholar
Konda, V., & Borkar, V. (1999). Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38:1, 94–123.
Google Scholar
Kushner, H., & Clark, D. (1978) Stochastic approximation methods for constrained and unconstrained systems. New York, NY: Springer Verlag.
Google Scholar
Law, A., & Kelton, W. (1999). Simulation modeling and analysis, 3rd edition. New York, NY: McGraw Hill.
Google Scholar
Littlewood, K. (1972). Forecasting and control of passenger bookings. In Proceedings of the 12th AGIFORS (Airline Group of the International Federation of Operational Research Societies) Symposium (pp. 95–117).
Littman, M. (1996). Algorithms for sequential decision-making. Unpublished Ph.D. Thesis, Brown University, Providence, R.I.
Mahadevan, S. (1994). To discount or not to discount: A case study comparing R-learning and Q-learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 164–172). New Brunswick, NJ.
Mahadevan, S. (1996a). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22:1, 159–195.
Google Scholar
Mahadevan, S. (1996b). An average-reward reinforcement learning algorithm for learning bias-optimal policies. In Proceedings of the 13th National Conference on Artificial Intelligence (pp. 875–880). Cambridge, MA: MIT Press.
Google Scholar
McGill, J., & van Ryzin, G. (1999). Revenue management: Research overview and prospects. Transporation Science, 33:2, 233–256.
Google Scholar
Puterman, M. (1994). Markov decision processes. New York, NY: Wiley Interscience.
Google Scholar
Robinson, L. (1995). Optimal and approximate control policies for airline booking with sequential nonmonotonic fare classes. Operations Research, 43, 252–263.
Google Scholar
Rummery, G., & Niranjan, M. (1994). On-line Q-Learning using connectionist systems. Technical Report CUED/FINFENG/TR 166. Engineering Department, Cambridge University, England.
Google Scholar
Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth Annual Conference on Machine Learning (pp. 298–305). Morgan Kaufmann.
Shlifer, E., & Vardi, Y. (1975). An airline overbooking policy. Transportation Science, 9, 101–114.
Google Scholar
Subramaniam, J. Stidham, S., & Lautenbacher, C. (1999). Airline yield management with overbooking, cancellations and no-shows. Transportation Science, 33:2, 147–167.
Google Scholar
Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.
Google Scholar
Sutton, R.S., & Barto, A. (1996). Reinforcement learning. Cambridge, MA: MIT Press.
Google Scholar
Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-Learning. Machine Learning, 16, 185–202.
Google Scholar
van Ryzin, G., & McGill, J. (2000). Revenue management without forecasting or optimization: An adaptive algorithm for determining seat protection levels. Management Science, 46, 568–573.
Google Scholar
Wollmer, R. (1992). An airline seat management model for a single leg route when lower fare classes book first. Operations Research, 40, 26–37.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial Engineering, The State University of New York at Buffalo, 342 Bell Hall, Box 602050, Buffalo, NY, 14260-2050, USA
Abhijit Gosavi

Authors

Abhijit Gosavi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gosavi, A. A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis. Machine Learning 55, 5–29 (2004). https://doi.org/10.1023/B:MACH.0000019802.64038.6c

Download citation

Issue Date: April 2004
DOI: https://doi.org/10.1023/B:MACH.0000019802.64038.6c

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning

Average Reward Reinforcement Learning for Semi-Markov Decision Processes

Introduction to Reinforcement Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Abstract

Article PDF

Similar content being viewed by others

Reinforcement Learning

Average Reward Reinforcement Learning for Semi-Markov Decision Processes

Introduction to Reinforcement Learning

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation