Advertisement

Discrete Event Dynamic Systems

, Volume 13, Issue 1–2, pp 41–77 | Cite as

Recent Advances in Hierarchical Reinforcement Learning

  • Andrew G. Barto
  • Sridhar Mahadevan
Article

Abstract

Reinforcement learning is bedeviled by the curse of dimensionality: the number of parameters to be learned grows exponentially with the size of any compact encoding of a state. Recent attempts to combat the curse of dimensionality have turned to principled ways of exploiting temporal abstraction, where decisions are not required at each step, but rather invoke the execution of temporally-extended activities which follow their own policies until termination. This leads naturally to hierarchical control architectures and associated learning algorithms. We review several approaches to temporal abstraction and hierarchical organization that machine learning researchers have recently developed. Common to these approaches is a reliance on the theory of semi-Markov decision processes, which we emphasize in our review. We then discuss extensions of these ideas to concurrent activities, multiagent coordination, and hierarchical memory for addressing partial observability. Concluding remarks address open challenges facing the further development of reinforcement learning in a hierarchical setting.

Keywords

Learning Algorithm Reinforcement Learning Recent Attempt Hierarchical Organization Control Architecture 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andre, D., and Russell, S. J. 2001. Programmable reinforcement learning agents. In Advances in Neural Information Processing Systems: Proceedings of the 2000 Conference. Cambridge, MA: MIT Press, pp. 1019–1025.Google Scholar
  2. Barto, A. G., Bradtke, S. J., and Singh, S. P. 1995. Learning to act using real-time dynamic programming. Artificial Intelligence 72: 81–138.Google Scholar
  3. Bernstein, D., Zilberstein, S., and Immerman, N. 2000. The complexity of decentralized control of markov decision processes. In 16th Conference on Uncertainty in Artificial Intelligence.Google Scholar
  4. Bertsekas, D. P. 1987. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, NJ.Google Scholar
  5. Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific.Google Scholar
  6. Boyen, X., and Koller, D. 1998. Tractable inference for complex stochastic processes. In G. F. Cooper and S. Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in AI. San Francisco, CA, Morgan Kaufmann, pp. 33–42.Google Scholar
  7. Bradtke, S. J., and Duff, M. O. 1995. Reinforcement learning methods for continuous-time Markov decision problems. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, Cambridge, MA: MIT Press, pp. 393–400.Google Scholar
  8. Branicky, M. S., Borkar, V. S., and Mitter, S. K. 1998. A unified framework for hybrid control: Model and optimal control theory. IEEE Transactions on Automatic Control 43: 31–45.Google Scholar
  9. Brooks, R. A. 1986. Achieving Artificial Intelligence through building robots. Technical Report A.I. Memo 899, Massachusetts Institute of Technology Artificial Intelligence Laboratory, Cambridge, MA.Google Scholar
  10. Crites, R. H. 1996. Large-Scale Dynamic Optimization Using Teams of Reinforcement Learning Agents. Ph.D. thesis, Amberst, MA: University of Massachusetts.Google Scholar
  11. Crites, R. H., and Barto, A. G. 1998. Elevator group control using multiple reinforcement learning agents. Machine Learning 33: 235–262.Google Scholar
  12. Das, T. K., Gosavi, A., Mahadevan, S., and Marchalleck, N. 1999. Solving semi-Markov decision problems using average reward reinforcement learning. Management Science 45: 560–574.Google Scholar
  13. Dean, T. L., and Kanazawa, K. 1989. A model for reasoning about persistence and causation. Computational Intelligence 5: 142–150.Google Scholar
  14. Dietterich, T. G. 2000. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research 13: 227–303.Google Scholar
  15. Digney, B. 1996. Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments. In P. Meas and M. Mataric, editors, From Animals to Animats 4: The Fourth Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press.Google Scholar
  16. Digney, B. 1998. Learning hierarchical control structure from multiple tasks and changing environments. In From Animals to Animals 5: The Fifth Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press.Google Scholar
  17. Driessens, K., and Dzeroski, S. 2002. Integrating experimentation and guidance in relational reinforcement learning. In Machine Learning: Proceedings of the Nineteenth International Conference on Machine Learning.Google Scholar
  18. Fikes, R. E., Hart, P. E., and Nilsson, N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3: 251–288.Google Scholar
  19. Finc, S., Singer, Y., and Tishby, N. 1998. The hierarchical hidden Markov model: analysis and applications. Machine Learning 32(1): July.Google Scholar
  20. Forestier, J.-P., and Varaiya, P. 1978. Multilayer control of large Markov chains. IEEE Transactions on Automatic Control AC-23: 298–304.Google Scholar
  21. Ghavamzadeh, M., and Mahadevan, S. 2001. Continuous-time hierarchical reinforcement learning. In Proceedings of the Eighteenth International Conference on Machine Learning.Google Scholar
  22. Grudic, G. Z., and Ungar, L. H. 2000. Localizing search in reinforcement learning. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-00), pp. 590–595.Google Scholar
  23. Harel, D. 1987. Statecharts: A visual formalixm for complex systems. Science of Computer Programming 8: 231–274.Google Scholar
  24. Hengst, B. 2002. Discovering hierarchy in reinforcement learning with hexq. In Machine Learning: Proceedings of the Nineteenth International Conference on Machine Learning.Google Scholar
  25. Hernandez, N., and Mahadevan, S. 2001. Hierarchical memory-based reinforcement learning. Proceedings of Neural Information Processing Systems.Google Scholar
  26. Howard, R. A. 1971. Dynamic Probabilistic Systems: Semi-Markov and Decision Processes. New York: Wiley.Google Scholar
  27. Huber, M., and Grupen, R. A. 1997. A feedback control structure for on-line learning tasks. Robotics and Autonomous Systems 22: 303–315.Google Scholar
  28. Iba, G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3: 285–317.Google Scholar
  29. Jaakkola, T., Jordan, M. I., and Singh, S. P. 1994. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6: 1185–1201.Google Scholar
  30. Jonsson, A., and Barto, A. G. 2001. Automated state abstraction for options using the U-tree algorithm. In Advances in Neural Information Processing Systems: Proceedings of the 2000 Conference, Cambridge, MA: MIT Press, pp. 1054–1060.Google Scholar
  31. Kaelbling, L., Littman, M., and Cassandra A. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101.Google Scholar
  32. Kaelbling, L. P., Littman, M. L., and Moore, A. W. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4: 237–285.Google Scholar
  33. Klopf, A. H. 1974. Brain function and adaptive systems-ÐA heterostatic theory. Technical Report AFCRL–72–0164, Air Force Cambridge Research Laboratories, Bedford, MA, 1972. A summary appears in Proceedings of the International Conference on Systems, Man, and Cybernetics, IEEE Systems, Man, and Cybernetics Society, Dallas, TXGoogle Scholar
  34. Klopf, A. H. 1982. The Hedonistic Neuron: A Theory of Memory, Learning, and Intelligence. Washington, D.C.: Hemisphere.Google Scholar
  35. Koening, S., and Simmons, R. 1997. Xavier: A robot navigation architecture based on partially observable Markov decision process models. In D. Kortenkamp, P. Bonasso, and R. Murphy, editors, Al-based Mobile Robots: Case-studies of Successful Robot Systems. Cambridge, MA: MIT Press.Google Scholar
  36. Kokotovic, P. V., Khalil, H. K., and O'Reilly, J. 1986. Singular Perturbation Methods in Control: Analysis and Design. London: Academic Press.Google Scholar
  37. Korf, R. E. 1985. Learning to Solve Problems by Searching for Macro-Operators. Boston, MA: Pitman.Google Scholar
  38. Littman, M. 1994. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 157–163.Google Scholar
  39. Mahadevan, S. 1996. Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning 22: 159–196.Google Scholar
  40. Mahadevan, S., Marchalleck, N., Das, T., and Gosavi, A. 1997. Self-improving factory simulation using continuous-time average-reward reinforcement learning. In Machine Learning: Proceedings of the Fourteenth International Conference.Google Scholar
  41. Makar, R., Mahadevan, S., and Ghavamzadeh, M. 2001. Hierarchical multi-agent reinforcement learning. In J. P. MuÈller, E. Andre, S. Sen, and C. Frasson, editors, Proceedings of the Fifth International Conference on Autonomous Agents, pp. 246–253.Google Scholar
  42. McCallum, A. K. 1996. Reinforcement Learning with Selective Perception and Hidden State. Ph.D. thesis, University of Rochester.Google Scholar
  43. McGovern, A. 2002. Autonomous Discovery of Temporal Abstractions from Interaction with An Environment. Ph.D. thesis, University of Massachusetts.Google Scholar
  44. McGovern, A., and Barto, A. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In C. Brodley and A. Danyluk, editors, Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, pp. 361–368.Google Scholar
  45. Minsky, M. L. 1954. Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem. Ph.D. thesis, Princeton University.Google Scholar
  46. Naidu, D. S. 1988. Singular Perturbation Methodology in Control Systems. London: Peter Peregrinus Ltd.Google Scholar
  47. Nourbakhsh, I., Powers, R., and Birchfield, S. 1995. Dervish: An office-navigation robot. Al Magazine 16(2): 53–60.Google Scholar
  48. Parr, R. 1998. Hierarchical Control and Learning for Markov Decision Processes. Ph.D. Thesis, Berkeley CA: University of California.Google Scholar
  49. Parr, R., and Russell, S. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems: Proceedings of the 1997 Conference. Cambridge, MA: MIT Press.Google Scholar
  50. Perkins, T. J., and Barto, A. G. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research. To appear.Google Scholar
  51. Perkins, T. J., and Barto, A. G. 2001. Lyapunov-constrained action sets for reinforcement learning. In C. Brodley and A. Danyluk, editors. Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, pp. 409–416.Google Scholar
  52. Precup, D. 2000. Temporal Abstraction in Reinforcement Learning. Ph.D. thesis, Amherst, MA: University of Massachusetts.Google Scholar
  53. Precup, D., and Sutton, R. S. 1998. Multi-time models for temporally abstract planning. In: Advances in Neural Information Processing Systems: Proceedings of the 1997 Conference. Cambridge MA: MIT Press, pp. 1050–1056.Google Scholar
  54. Precup, D., Sutton, R. S., and Singh, S. 1998. Theoretical results on reinforcement learning with temporally abstract options. In Proceedings of the 10th European Conference on Machine Learning, ECML-98. Springer Verlag, pp. 382–393.Google Scholar
  55. Puterman, M. L. 1994. Markov Decision Problems. New York: Wiley.Google Scholar
  56. Rohanimanesh, K., and Mahadevan, S. Structured approximation of stochastic temporally extended actions. In preparation.Google Scholar
  57. Rohanimanesh, K., and Mahadevan, S.2001. Decision-theoretic planning with concurrent temporally extended actions. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence.Google Scholar
  58. Ross, S. 1983. Introduction to Stochastic Dynamic Programming. New York: Academic Press.Google Scholar
  59. Rummery, G. A., and Niranjan, M. 1994. On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.Google Scholar
  60. Samuel, A. L. 1959. Some studies in machine learning using the game of checkers. IBM Journal on Research and Development 3: 211–229. Reprinted in E. A. Feigenbaum and J. Feldman, editors, Computers and Thought, New York: McGraw-Hill, pp. 71–105.Google Scholar
  61. Samuel, A. L. 1967. Some studies in machine learning using the game of checkers. II—Recent progress. IBM Journal on Research and Development 11: 601–617.Google Scholar
  62. Schwartz, A. 1993. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann, pp. 298–305.Google Scholar
  63. Shatkay, H., and Kaelbling, L. P. 1997. Learning topological maps with weak local odometric information. In IJCAI 2), pp. 920–929.Google Scholar
  64. Singh, S., and Bertsekas, D. 1997. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference. Cambridge, MA: MIT Press.Google Scholar
  65. Singh, S., Jaakkola, T., Littman, M. L., and SzepesvaÂri. C. 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38: 287–308.Google Scholar
  66. Singh, S. P. 1992. Reinforcement learning with a hierarchy of abstract models. In Proceedings of the Tenth National Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press/MIT Press, pp. 202–207.Google Scholar
  67. Singh, S. P. 1992. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Proceedings of the Ninth International Machine Learning Conference. San Mateo, CA: Morgan Kaufmann, pp. 406–415.Google Scholar
  68. Stone, P., and Sutton, R. S. 2001. Scaling reinforcement learning toward RoboCup soccer. In C. Brodley and A. Danyluk, editors, Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, pp. 537–544.Google Scholar
  69. Sugawara, T. and Lesser. V. 1998. Learning to improve coordinated actions in cooperative distributed problem-solving environments. Machine Learning 33: 129–154.Google Scholar
  70. Sutton, R. S. 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference. Cambridge, MA: MIT Press, pp. 1038–1044.Google Scholar
  71. Sutton, R. S., and Barto, A. G. 1981. Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review 88: 135–170.Google Scholar
  72. Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.Google Scholar
  73. Sutton, R. S., Precup, D., and Singh, S. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112: 181–211.Google Scholar
  74. Tan, M. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, pp. 330–337.Google Scholar
  75. Tesauro, G. J. 1992. Practical issues in temporal difference learning. Machine Learning 8: 257–277.Google Scholar
  76. Tesauro, G. J. 1994. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2): 215–219.Google Scholar
  77. Theocharous, G. 2002. Hierarchical Learning and Planning in Partially Observable Markov Decision Processes. Ph.D. Thesis, Michigan State University.Google Scholar
  78. Theocharous, G., and Mahadevan, S. 2002. Approximate planning with hierarchical partially observable Markov decision process for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).Google Scholar
  79. Theocharous, G., Rohanimanesh, K., and Mahadevan, S. 2001. Learning hierarchical partially observable markov decision processes for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).Google Scholar
  80. Thrun, S. B., and Schwartz, A. 1995. Finding structure in reinforcement learning. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference. Cambridge MA: MIT Press, pp. 385–392.Google Scholar
  81. Tsitsiklis, J. N., and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42: 674–690.Google Scholar
  82. Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. Ph.D. thesis. Cambridge, U.K.: Cambridge University.Google Scholar
  83. Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning. Machine Learning 8: 279–292.Google Scholar
  84. Weiss, G. 1999. Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence. Cambridge, MA: MIT Press.Google Scholar
  85. Werbos, P. J. 1977. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook 22: 25–38.Google Scholar
  86. Werbos, P. J. 1987. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics 17: 7–20.Google Scholar
  87. Werbos, P. J. 1992. Approximate dynamic programming for real-time control and neural modeling. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. New York: Van Nostrand Reinhold, pp. 493–525.Google Scholar
  88. Woods, W. A. 1970. Transition network grammars for natural language analysis. Communications of the ACM 13: 591–606.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Andrew G. Barto
    • 1
  • Sridhar Mahadevan
    • 1
  1. 1.Autonomous Learning Laboratory, Department of Computer ScienceUniversity of MassachusettsAmherstUSA

Personalised recommendations