Advertisement

Annals of Operations Research

, Volume 241, Issue 1–2, pp 319–356 | Cite as

Perspectives of approximate dynamic programming

  • Warren B. Powell
Article

Abstract

Approximate dynamic programming has evolved, initially independently, within operations research, computer science and the engineering controls community, all searching for practical tools for solving sequential stochastic optimization problems. More so than other communities, operations research continued to develop the theory behind the basic model introduced by Bellman with discrete states and actions, even while authors as early as Bellman himself recognized its limits due to the “curse of dimensionality” inherent in discrete state spaces. In response to these limitations, subcommunities in computer science, control theory and operations research have developed a variety of methods for solving different classes of stochastic, dynamic optimization problems, creating the appearance of a jungle of competing approaches. In this article, we show that there is actually a common theme to these strategies, and underpinning the entire field remains the fundamental algorithmic strategies of value and policy iteration that were first introduced in the 1950’s and 60’s.

Keywords

Reinforcement Learning Markov Decision Process Policy Iteration Approximate Dynamic Programming Heuristic Dynamic Programming 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Barto, A. G., Sutton, R. S., & Brouwer, P. (1981). Associative search network: A reinforcement learning associative memory. Biological Cybernetics, 40(3), 201–211. CrossRefGoogle Scholar
  2. Barto, A., Sutton, R. S., & Anderson, C. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 834–846. CrossRefGoogle Scholar
  3. Bellman, R. E. (1957). Dynamic programming. Princeton: Princeton University Press. Google Scholar
  4. Bellman, R. E. (1971). Introduction to the mathematical theory of control processes (Vol. II). New York: Academic Press. Google Scholar
  5. Bellman, R. E., & Dreyfus, S. (1959). Functional approximations and dynamic programming. Mathematical Tables and Other Aids To Computation, 13, 247–251. CrossRefGoogle Scholar
  6. Bertsekas, D. P. (2011a). Approximate dynamic programming. In Dynamic programming and optimal control (Vol. II, 3rd ed.). Belmont: Athena Scientific, Chap. 6. Google Scholar
  7. Bertsekas, D. P. (2011b). Approximate policy iteration: A survey and some new methods, Journal of Control Theory and Applications, 9(3), 310–335. CrossRefGoogle Scholar
  8. Bertsekas, D. P., & Castanon, D. A. (1999). Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5, 89–108. CrossRefGoogle Scholar
  9. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont: Athena Scientific. Google Scholar
  10. Birge, J. R., & Louveaux, F. (1997). Introduction to stochastic programming. New York: Springer. Google Scholar
  11. Boesel, J., Nelson, B., & Kim, S. (2003). Using ranking and selection to “clean up” after simulation optimization. Operations Research, 51(5), 814–825. CrossRefGoogle Scholar
  12. Bradtke, S. J., & Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1), 33–57. Google Scholar
  13. Burnetas, A., & Katehakis, M. N. (1997). Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1), 222–225. CrossRefGoogle Scholar
  14. Cheung, R. K.-M., & Powell, W. B. (1996). An algorithm for multistage dynamic networks with random arc capacities, with an application to dynamic fleet management. Operations Research, 44, 951–963. CrossRefGoogle Scholar
  15. Chick, S. E., & Gans, N. (2009). Economic analysis of simulation selection problems. Management Science, 55(3), 421–437. CrossRefGoogle Scholar
  16. Dantzig, G. (1955). Linear programming under uncertainty. Management Science, 1, 197–206. CrossRefGoogle Scholar
  17. Dantzig, G., & Ferguson, A. (1956). The allocation of aircrafts to routes: An example of linear programming under uncertain demand. Management Science, 3, 45–73. CrossRefGoogle Scholar
  18. Denardo, E. V. (1982). Dynamic programming. Englewood Cliffs: Prentice-Hall. Google Scholar
  19. Derman, C. (1962). On sequential decisions and Markov chains. Management Science, 9(1), 16–24. CrossRefGoogle Scholar
  20. Derman, C. (1966). Denumerable state Markovian decision processes-average cost criterion. Annals of Mathematical Statistics, 37(6), 1545–1553. CrossRefGoogle Scholar
  21. Derman, C. (1970). Finite state Markovian decision processes. New York: Academic Press. Google Scholar
  22. Dreyfus, S., & Law, A. M. (1977). The art and theory of dynamic programming. New York: Academic Press. Google Scholar
  23. Dupaçová, J., Consigli, G., & Wallace, S. W. (2000). Scenarios for multistage stochastic programs. Annals of Operations Research, 100, 25–53. CrossRefGoogle Scholar
  24. Dupacova, J. (1995). Multistage stochastic programs—the state of the art and selected bibliography. Kybernetica, 31, 151–174. Google Scholar
  25. Dynkin, E. B., & Yushkevich, A. A. (1979). Controlled Markov processes. In A series of comprehensive studies in mathematics: Vol. 235. Grundlehren der mathematischen Wissenschaften. New York: Springer. Google Scholar
  26. Enders, J., Powell, W. B., & Egan, D. M. (2010). Robust policies for the transformer acquisition and allocation problem. Energy Systems, 1(3), 245–272. CrossRefGoogle Scholar
  27. Frazier, P. I., Powell, W. B., & Dayanik, S. (2008). A knowledge gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5), 2410–2439. CrossRefGoogle Scholar
  28. Frazier, P. I., Powell, W. B., & Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4), 599–613. CrossRefGoogle Scholar
  29. George, A., Powell, W. B., & Kulkarni, S. (2008). Value function approximation using multiple aggregation for multiattribute resource management. Journal of Machine Learning Research, 9, 2079–2111. Google Scholar
  30. Gittins, J., Glazebrook, K., & Weber, R. R. (2011). Multi-armed bandit allocation indices. New York: Wiley. CrossRefGoogle Scholar
  31. Growe-Kuska, N., Heitsch, H., & Romisch, W. (2003). Scenario reduction and scenario tree construction for power management problems. In A. Borghetti, C. A. Nucci, & M. Paolone (Eds.), IEEE Bologna power tech proceedings. Google Scholar
  32. Gupta, S., & Miescke, K. (1996). Bayesian look ahead one-stage sampling allocations for selection of the best population. Journal of Statistical Planning and Inference, 54(2), 229–244. CrossRefGoogle Scholar
  33. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference and prediction. New York: Springer. CrossRefGoogle Scholar
  34. Haykin, S. (1999). Neural networks: A comprehensive foundation. New York: Prentice Hall. Google Scholar
  35. Heyman, D. P., & Sobel, M. (1984). Stochastic models in operations research. Stochastic optimization (Vol. II). New York: McGraw-Hill. Google Scholar
  36. Higle, J., & Sen, S. (1996). Stochastic decomposition: A statistical method for large scale stochastic linear programming. Dordrecht: Kluwer Academic. CrossRefGoogle Scholar
  37. Howard, R. A. (1960). Dynamic programming and Markov process. Cambridge: MIT Press. Google Scholar
  38. Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6), 1185–1201. CrossRefGoogle Scholar
  39. Judd, K. L. (1998). Numerical methods in economics. Cambridge: MIT Press. Google Scholar
  40. Kall, P., & Wallace, S. (1994). Stochastic programming. New York: Wiley. Google Scholar
  41. Katehakis, M. N., & Derman, C. (1986). Computing optimal sequential allocation rules in clinical trials. In Lecture notes monograph series (Vol. 8, pp. 29–39). New York: JSTOR. Google Scholar
  42. Katehakis, M. N., & Robbins, H. (1995). Sequential choice from several populations. Proceedings of the National Academy of Sciences of the United States of America, 92, 8584–8585. CrossRefGoogle Scholar
  43. Katehakis, M. N., & Veinott, A. F. (1987). The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12(2), 262–268. CrossRefGoogle Scholar
  44. Kaut, M., & Wallace, S. W. (2003). Evaluation of scenario-generation methods for stochastic programming, Stochastic programming e-print series. Google Scholar
  45. Kushner, H. J., & Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications. Berlin: Springer. Google Scholar
  46. Law, A., & Kelton, W. (1991). Simulation modeling and analysis (Vol. 2). New York: McGraw-Hill. Google Scholar
  47. Lewis, F., Jagannathan, S., & Yesildirek, A. (1999). Neural network control of robot manipulators and nonlinear systems. New York: CRC Press. Google Scholar
  48. Lewis, F. L., & Syrmos, V. L. (1995). Optimal control. Hoboken: Wiley-Interscience. Google Scholar
  49. Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits and Systems Magazine, 9(3), 32–50. CrossRefGoogle Scholar
  50. Maei, H. R., Szepesvari, C., Bhatnagar, S., & Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In ICML-2010. Google Scholar
  51. Negoescu, D. M., Frazier, P. I., & Powell, W. B. (2011). The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3), 346–363. CrossRefGoogle Scholar
  52. Nemhauser, G. L. (1966). Introduction to dynamic programming. New York: Wiley. Google Scholar
  53. Powell, W., & Ryzhov, I. (2012). Optimal learning. Hoboken: Wiley. CrossRefGoogle Scholar
  54. Powell, W. B. (1987). An operational planning model for the dynamic vehicle allocation problem with uncertain demands. Transportation Research, 21B, 217–232. CrossRefGoogle Scholar
  55. Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality. Hoboken: Wiley. CrossRefGoogle Scholar
  56. Powell, W. B. (2010). Merging AI and OR to solve high-dimensional stochastic optimization problems using approximate dynamic programming. INFORMS Journal on Computing, 22(1), 2–17. CrossRefGoogle Scholar
  57. Powell, W. B. (2011). Approximate dynamic programming: Solving the curses of dimensionality (2nd. ed.) Hoboken: Wiley. CrossRefGoogle Scholar
  58. Powell, W. B., & Frantzeskakis, L. F. (1990). A successive linear approximation procedure for stochastic dynamic vehicle allocation problems. Transportation Science, 24, 40–57. CrossRefGoogle Scholar
  59. Powell, W. B., & Godfrey, G. (2002). An adaptive dynamic programming algorithm for dynamic fleet management, I: Single period travel times. Transportation Science, 36(1), 21–39. CrossRefGoogle Scholar
  60. Powell, W. B., & Ma, J. (2011). A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications, 9(3), 336–352. CrossRefGoogle Scholar
  61. Powell, W. B., & Simão, H. (2009). Approximate dynamic programming for management of high-value spare parts. Journal of Manufacturing Technology and Management, 20(2), 147–160. CrossRefGoogle Scholar
  62. Powell, W. B., & Topaloglu, H. (2005). Fleet management. In S. Wallace & W. Ziemba (Eds.), SIAM series in optimization. Applications of stochastic programming (pp. 185–216). Philadelphia: Math Programming Society. CrossRefGoogle Scholar
  63. Powell, W. B., & Van Roy, B. (2004). Approximate dynamic programming for high dimensional resource allocation problems. In J. Si, A. G. Barto, W. B. Powell, & D. W. II (Eds.), Handbook of learning and approximate dynamic programming. New York: IEEE Press. Google Scholar
  64. Powell, W. B., George, A., Lamont, A., & Stewart, J. (2011). SMART: A stochastic multiscale model for the analysis of energy resources, technology and policy. INFORMS Journal on Computing. http://dx.doi.org/10.1287/ijoc.1110.0470.
  65. Puterman, M. L. (1994). Markov decision processes (1st ed.). Hoboken: Wiley. CrossRefGoogle Scholar
  66. Puterman, M. L. (2005). Markov decision processes (2nd ed.). Hoboken: Wiley. Google Scholar
  67. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22(3), 400–407. CrossRefGoogle Scholar
  68. Romisch, W., & Heitsch, H. (2009). Scenario tree modeling for multistage stochastic programs. Mathematical Programming, 118, 371–406. CrossRefGoogle Scholar
  69. Ross, S. (1983). Introduction to stochastic dynamic programming. New York: Academic Press. Google Scholar
  70. Ryzhov, I., & Powell, W. B. (2011). Bayesian active learning with basis functions. In 2011 IEEE symposium series on computational intelligence, No 3. Paris: IEEE Press. Google Scholar
  71. Ryzhov, I., Frazier, P. I., & Powell, W. B. (2012). Stepsize selection for approximate value iteration and a new optimal stepsize rule (Technical report). Department of Operations Research and Financial Engineering, Princeton University. Google Scholar
  72. Ryzhov, I. O., Powell, W. B., & Frazier, P. I. (n.d.). The knowledge gradient algorithm for a general class of online learning problems. Google Scholar
  73. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3, 211–229. CrossRefGoogle Scholar
  74. Sen, S., & Higle, J. (1999). An introductory tutorial on stochastic linear programming models. Interfaces, 29(2), 33–6152. CrossRefGoogle Scholar
  75. Si, J., & Wang, Y. T. (2001). Online learning control by association and reinforcement. IEEE Transactions on Neural Networks, 12(2), 264–276. CrossRefGoogle Scholar
  76. Si, J., Barto, A. G., Powell, W. B., & Wunsch, D. (2004). Handbook of learning and approximate dynamic programming. New York: Wiley-IEEE Press. CrossRefGoogle Scholar
  77. Silver, D. (2009). Reinforcement learning and simulation-based search in computer go. PhD thesis, University of Alberta. Google Scholar
  78. Simao, H. P., Day, J., George, A. P., Gifford, T., Powell, W. B., & Nienow, J. (2009). An approximate dynamic programming algorithm for large-scale fleet management: A case application. Transportation Science, 43(2), 178–197. CrossRefGoogle Scholar
  79. Simao, H. P., George, A., Powell, W. B., Gifford, T., Nienow, J., & Day, J. (2010). Approximate dynamic programming captures fleet operations for Schneider national. Interfaces, 40(5), 1–11. CrossRefGoogle Scholar
  80. Spall, J. C. (2003). Introduction to stochastic search and optimization: Estimation, simulation and control. Hoboken: Wiley. CrossRefGoogle Scholar
  81. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. Google Scholar
  82. Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks. Psychological Review, 88(2), 135–170. CrossRefGoogle Scholar
  83. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning (Vol. 35). Cambridge: MIT Press. Google Scholar
  84. Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C., & Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th annual international conference on machine learning—ICML’09 (pp. 1–8). New York: ACM Press. CrossRefGoogle Scholar
  85. Sutton, R. S., Szepesvari, C., & Maei, H. (2009b). A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in neural information processing systems (Vol. 21, pp. 1609–1616). Princeton: Citeseer. Google Scholar
  86. Topaloglu, H., & Powell, W. B. (2006). Dynamic programming approximations for stochastic, time-staged integer multicommodity flow problems. INFORMS Journal on Computing, 18, 31–42. CrossRefGoogle Scholar
  87. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202. Google Scholar
  88. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674–690. CrossRefGoogle Scholar
  89. Van Roy, B., Bertsekas, D. P., Lee, Y., & Tsitsiklis, J. N. (1997). A neuro-dynamic programming approach to retailer inventory management. In Proceedings of the IEEE conference on decision and control (Vol. 4, pp. 4052–4057). CrossRefGoogle Scholar
  90. Venayagamoorthy, G., & Harley, R. (2002). Comparison of heuristic dynamic programming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator. IEEE Transactions on Neural Networks, 13(3), 764–773. CrossRefGoogle Scholar
  91. Wang, F.-Y., Zhang, H., & Liu, D. (2009). Adaptive dynamic programming: An introduction. IEEE Computational Intelligence Magazine, May, 39–47. CrossRefGoogle Scholar
  92. Watkins, C. (1989). Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England. Google Scholar
  93. Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. Google Scholar
  94. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University. Google Scholar
  95. Werbos, P. J. (1989). Backpropagation and neurocontrol: A review and prospectus, Neural Networks, 209–216. Google Scholar
  96. Werbos, P. J. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks, 3, 179–189. CrossRefGoogle Scholar
  97. Werbos, P. J. (1992a). Approximate dynamic programming for real-time control and neural modelling. In D. J. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches. Google Scholar
  98. Werbos, P. J. (1992b). Neurocontrol and supervised learning: An overview and valuation. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy, and adaptive approaches. Google Scholar
  99. Werbos, P. J., Miller, W. T., & Sutton, R. S. (Eds.) (1990). Neural networks for control. Cambridge: MIT Press. Google Scholar
  100. White, D. J. (1969). Dynamic programming. San Francisco: Holden-Day. Google Scholar
  101. Wu, T., Powell, W. B., & Whisman, A. (2009). The optimizing-simulator: An illustration using the military airlift problem. ACM Transactions on Modeling and Simulation, 19(3), 1–31. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Department of Operations Research and Financial EngineeringPrinceton UniversityPrincetonUSA

Personalised recommendations