Abstract
We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Abounadi, J., Bertsekas. D., & Borkar, V. (1998). Learning algorithms for Markov decision processes with average cost. Technical Report, LIDS-P-2434, MIT, MA.
Belobaba, P. (1989). Application of a probabilistic decision model to airline seat inventory control. Operations Research, 37, 183–197.
Bertsekas, D. (1995a). Dynamic programming and optimal control. Belmont, MA: Athena Scientific.
Bertsekas, D. (1995b). A new value iteration method for the average cost dynamic programming problem. Technical Report LIDS-P-2307, MIT, MA.
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Borkar, V. (1998). Asynchronous stochastic approximation. SIAM Journal of Control and Optimization, 36:3, 840–851.
Bradtke, S., & Duff, M. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press.
Brumelle, S., & McGill, J. (1993). Airline Seat Allocation With Multiple Nested Fare Classes Operations Research, 41, 127–137.
Chatwin, R. (1998). Multiperiod Airline Overbooking With A Single Fare Class. Operations Research, 46:6, 805–819.
Curry, R. (1990). Optimal airline seat allocation with fare classes nested by origins and desitnations. Transportation Science, 24, 193–204.
Das, T., Gosavi, A., Mahadevan, S., & Marchalleck, N. (1999). Solving semi-Markov decision problems using average reward reinforcement learning. Management Science, 45:3, 560–574.
Davis, P. (1994). Airline ties profitability yield to management. SIAM News, 27:5.
Glover, F., Glover, R., Lorenzo, J., & McMillan, C. (1982). The passenger-mix problem in the scheduled airlines. Interfaces, 12, 73–80.
Gosavi, A. (1999). An Algorithm for Solving semi-Markov Decision Problems Using Reinforcement Learning: Convergence Analysis and Numerical Results. Unpublished Ph.D. Dissertation, Department of Industrial and Management Systems Engineering, University of South Florida, Tampa, FL.
Gosavi, A. (2003). A Reinforcement Learning Algorithm for Solving Markov and Semi-Markov Decision Problems Under Long-Run Average Cost. European Journal of Operational Research, to appear.
Gosavi, A., Bandla, N., & Das, T. (2002). Airline seat allocation among multiple fare classes with overbooking. IIE Transactions, 34:9, 729–742.
Horner, P. (2000). Mother, father of invention produce golden child: Revenue management—A Sabre story. ORMS Today, 27:3, 75–77.
Howard, R. (1971). Dynamic probabilistic systems Volume II: Semi-Markov decision processes. New York, NY: John Wiley and Sons.
Konda, V., & Borkar, V. (1999). Actor-critic type learning algorithms for Markov decision processes. SIAM Journal on Control and Optimization, 38:1, 94–123.
Kushner, H., & Clark, D. (1978) Stochastic approximation methods for constrained and unconstrained systems. New York, NY: Springer Verlag.
Law, A., & Kelton, W. (1999). Simulation modeling and analysis, 3rd edition. New York, NY: McGraw Hill.
Littlewood, K. (1972). Forecasting and control of passenger bookings. In Proceedings of the 12th AGIFORS (Airline Group of the International Federation of Operational Research Societies) Symposium (pp. 95–117).
Littman, M. (1996). Algorithms for sequential decision-making. Unpublished Ph.D. Thesis, Brown University, Providence, R.I.
Mahadevan, S. (1994). To discount or not to discount: A case study comparing R-learning and Q-learning. In Proceedings of the 11th International Conference on Machine Learning (pp. 164–172). New Brunswick, NJ.
Mahadevan, S. (1996a). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22:1, 159–195.
Mahadevan, S. (1996b). An average-reward reinforcement learning algorithm for learning bias-optimal policies. In Proceedings of the 13th National Conference on Artificial Intelligence (pp. 875–880). Cambridge, MA: MIT Press.
McGill, J., & van Ryzin, G. (1999). Revenue management: Research overview and prospects. Transporation Science, 33:2, 233–256.
Puterman, M. (1994). Markov decision processes. New York, NY: Wiley Interscience.
Robinson, L. (1995). Optimal and approximate control policies for airline booking with sequential nonmonotonic fare classes. Operations Research, 43, 252–263.
Rummery, G., & Niranjan, M. (1994). On-line Q-Learning using connectionist systems. Technical Report CUED/FINFENG/TR 166. Engineering Department, Cambridge University, England.
Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth Annual Conference on Machine Learning (pp. 298–305). Morgan Kaufmann.
Shlifer, E., & Vardi, Y. (1975). An airline overbooking policy. Transportation Science, 9, 101–114.
Subramaniam, J. Stidham, S., & Lautenbacher, C. (1999). Airline yield management with overbooking, cancellations and no-shows. Transportation Science, 33:2, 147–167.
Sutton, R. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8 (pp. 1038–1044). Cambridge, MA: MIT Press.
Sutton, R.S., & Barto, A. (1996). Reinforcement learning. Cambridge, MA: MIT Press.
Tsitsiklis, J. (1994). Asynchronous stochastic approximation and Q-Learning. Machine Learning, 16, 185–202.
van Ryzin, G., & McGill, J. (2000). Revenue management without forecasting or optimization: An adaptive algorithm for determining seat protection levels. Management Science, 46, 568–573.
Wollmer, R. (1992). An airline seat management model for a single leg route when lower fare classes book first. Operations Research, 40, 26–37.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gosavi, A. A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis. Machine Learning 55, 5–29 (2004). https://doi.org/10.1023/B:MACH.0000019802.64038.6c
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000019802.64038.6c