Advertisement

Sequential Decision Making Based on Direct Search

  • Jürgen Schmidhuber
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1828)

Abstract

The most challenging open issues in sequential decision making include partial observability of the decision maker’s environment, hierarchical and other types of abstract credit assignment, the learning of credit assignment algorithms, and exploration without a priori world models. I will summarize why direct search (DS) in policy space provides a more natural framework for addressing these issues than reinforcement learning (RL) based on value functions and dynamic programming. Then I will point out fundamental drawbacks of traditional DS methods in case of stochastic environments, stochastic policies, and unknown temporal delays between actions and observable effects. I will discuss a remedy called the success-story algorithm, show how it can outperform traditional DS, and mention a relationship to market models combining certain aspects of DS and traditional RL.

Keywords

Direct Search Neural Information Processing System Sequential Decision Policy Space Direct Search Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andre, D. (1998). Learning hierarchical behaviors. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  2. Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic Programming — An Introduction. Morgan Kaufmann Publishers, San Francisco, CA, USA.zbMATHGoogle Scholar
  3. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.Google Scholar
  4. Baum, E. B., & Durdanovic, I. (1998). Toward code evolution by artificial economies. Tech. rep., NEC Research Institute, Princeton, NJ. Extension of a paper in Proc. 13th ICML’1996, Morgan Kaufmann, CA.Google Scholar
  5. Bellman, R. (1961). Adaptive Control Processes. Princeton University Press.Google Scholar
  6. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Belmont, MA.zbMATHGoogle Scholar
  7. Bowling, M., & Veloso, M. (1998). Bounding the suboptimality of reusing sub-problems. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  8. Chaitin, G. (1969). On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16, 145–159.zbMATHCrossRefMathSciNetGoogle Scholar
  9. Coelho, J., & Grupen, R. A. (1998). Control abstractions as state representation. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  10. Cohn, D. A. (1994). Neural network exploration using optimal experiment design. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 679–686. San Mateo, CA: Morgan Kaufmann.Google Scholar
  11. Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J. (Ed.), Proceedings of an International Conference on Genetic Algorithms and Their Applications Hillsdale NJ. Lawrence Erlbaum Associates.Google Scholar
  12. Dayan, P., & Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 5, pp. 271–278. San Mateo, CA: Morgan Kaufmann.Google Scholar
  13. Dayan, P., & Sejnowski, T. J. (1996). Exloration bonuses and dual control. Machine Learning, 25, 5–22.Google Scholar
  14. Dickmanns, D., Schmidhuber, J., & Winklhofer, A. (1987). Der genetische Algorithmus: Eine Implementierung in Prolog. Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl Prof. Radig, Technische Universität München..Google Scholar
  15. Digney, B. (1996). Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 363–372. MIT Press, Bradford Books.Google Scholar
  16. Eldracher, M., & Baginski, B. (1993). Neural subgoal generation using backpropagation. In Lendaris, G. G., Grossberg, S., & Kosko, B. (Eds.), World Congress on Neural Networks, pp. III-145–III-148. Lawrence Erlbaum Associates, Inc., Publishers, Hillsdale.Google Scholar
  17. Fedorov, V. V. (1972). Theory of optimal experiments. Academic Press.Google Scholar
  18. Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY.Google Scholar
  19. Harada, D., & Russell, S. (1998). Meta-level reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  20. Hochreiter, S., & Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In Mozer, M. C., Jordan, M. I., & Petsche, T. (Eds.), Advances in Neural Information Processing Systems 9, pp. 473–479. MIT Press, Cambridge MA.Google Scholar
  21. Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor.Google Scholar
  22. Holland, J. H. (1985). Properties of the bucket brigade. In Proceedings of an International Conference on Genetic Algorithms. Hillsdale, NJ.Google Scholar
  23. Huber, M., & Grupen, R. A. (1998). Learning robot control using control policies as abstract actions. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  24. Humphrys, M. (1996). Action selection methods using reinforcement learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 135–144. MIT Press, Bradford Books.Google Scholar
  25. Hwang, J., Choi, J., Oh, S., & II, R. J. M. (1991). Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1), 131–136.CrossRefGoogle Scholar
  26. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Tesauro, G., Touretzky, D. S., & Leen, T. K. (Eds.), Advances in Neural Information Processing Systems 7, pp. 345–352. MIT Press, Cambridge MA.Google Scholar
  27. Juels, A., & Wattenberg, M. (1996). Stochastic hillclimbing as a baseline method for evaluating genetic algorithms. In Touretzky, D. S., Mozer, M. C., & Hasselmo, M. E. (Eds.), Advances in Neural Information Processing Systems, Vol. 8, pp. 430–436. The MIT Press, Cambridge, MA.Google Scholar
  28. Kaelbling, L. (1993). Learning in Embedded Systems. MIT Press.Google Scholar
  29. Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains. Tech. rep., Brown University, Providence RI.Google Scholar
  30. Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.Google Scholar
  31. Kirchner, F. (1998). Q-learning of complex behaviors on a six-legged walking machine. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  32. Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowedge on goal-directed exploration with reinforcement learnign algorithm. Machine Learning, 22, 228–250.Google Scholar
  33. Kolmogorov, A. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1, 1–11.Google Scholar
  34. Koumoutsakos P., F. J., & D., P. (1998). Evolution strategies for parameter optimization in jet flow control. Center for Turbulence Research — Proceedings of the Summer program 1998, 10, 121–132.Google Scholar
  35. Lenat, D. (1983). Theory formation by heuristic search. Machine Learning, 21.Google Scholar
  36. Levin, L. A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3), 265–266.Google Scholar
  37. Levin, L. A. (1984). Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61, 15–37.zbMATHCrossRefMathSciNetGoogle Scholar
  38. Li, M., & Vitányi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer.Google Scholar
  39. Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. Ph.D. thesis, Carnegie Mellon University, Pittsburgh.Google Scholar
  40. Littman, M. (1996). Algorithms for Sequential Decision Making. Ph.D. thesis, Brown University.Google Scholar
  41. Littman, M., Cassandra, A., & Kaelbling, L. (1995). Learning policies for partially observable environments: Scaling up. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 362–370. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  42. MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(2), 550–604.MathSciNetGoogle Scholar
  43. McCallum, R. A. (1996). Learning to use selective attention and short-term memory in sequential tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., & Wilson, S. W. (Eds.), From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pp. 315–324. MIT Press, Bradford Books.Google Scholar
  44. McGovern, A. (1998). acquire-macros: An algorithm for automatically learning macro-action. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  45. Moore, A., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.Google Scholar
  46. Moore, A. W., Baird, L., & Kaelbling, L. P. (1998). Multi-value-functions: Efficient automatic action hierarchies for multiple goal mdps. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  47. Plutowski, M., Cottrell, G., & White, H. (1994). Learning Mackey-Glass from 25 examples, plus or minus 2. In Cowan, J., Tesauro, G., & Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6, pp. 1135–1142. San Mateo, CA: Morgan Kaufmann.Google Scholar
  48. Ray, T. S. (1992). An approach to the synthesis of life. In Langton, C., Taylor, C., Farmer, J. D., & Rasmussen, S. (Eds.), Artificial Life II, pp. 371–408. Addison Wesley Publishing Company.Google Scholar
  49. Rechenberg, I. (1971). Evolutionsstrategie-Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Dissertation.. Published 1973 by Fromman-Holzboog.Google Scholar
  50. Ring, M. B. (1991). Incremental development of complex behaviors through automatic construction of sensory-motor hierarchies. In Birnbaum, L., & Collins, G. (Eds.), Machine Learning: Proceedings of the Eighth International Workshop, pp. 343–347. Morgan Kaufmann.Google Scholar
  51. Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, GIles J. D. C., & Giles, C. L. (Eds.), Advances in Neural Information Processing Systems 5, pp. 115–122. Morgan Kaufmann.Google Scholar
  52. Ring, M. B. (1994). Continual Learning in Reinforcement Environments. Ph.D. thesis, University of Texas at Austin, Austin, Texas 78712.Google Scholar
  53. SaIlustowicz, R. P., & Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolutionary Computation, 5(2), 123–141.CrossRefGoogle Scholar
  54. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3, 210–229.CrossRefGoogle Scholar
  55. Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut für Informatik, Technische Universität München..Google Scholar
  56. Schmidhuber, J. (1989). A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403–412.CrossRefGoogle Scholar
  57. Schmidhuber, J. (1991a). Curious model-building control systems. In Proc. International Joint Conference on Neural Networks, Singapore, Vol. 2, pp. 1458–1463. IEEE.CrossRefGoogle Scholar
  58. Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T., Mäkisara, K., Simula, O., & Kangas, J. (Eds.), Artificial Neural Networks, pp. 967–972. Elsevier Science Publishers B.V., North-Holland.Google Scholar
  59. Schmidhuber, J. (1991c). Reinforcement learning in Markovian and non-Markovian environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 3, pp. 500–506. San Mateo, CA: Morgan Kaufmann.Google Scholar
  60. Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 488–496. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  61. Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5), 857–873.CrossRefGoogle Scholar
  62. Schmidhuber, J. (1999). Artificial curiosity based on discovering novel algorithmic predictability through coevolution. In Angeline, P., Michalewicz, Z., Schoenauer, M., Yao, X., & Zalzala, Z. (Eds.), Congress on Evolutionary Computation, pp. 1612–1618. IEEE Press, Piscataway, NJ.Google Scholar
  63. Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Computation, 5(4), 625–635.CrossRefGoogle Scholar
  64. Schmidhuber, J., & Zhao, J. (1999). Direct policy search and uncertain policy evaluation. In AAAI Spring Symposium on Search under Uncertain and Incomplete Information, Stanford Univ., pp. 119–124. American Association for Artificial Intelligence, Menlo Park, Calif.Google Scholar
  65. Schmidhuber, J., Zhao, J., & Schraudolph, N. (1997a). Reinforcement learning with self-modifying policies. In Thrun, S., & Pratt, L. (Eds.), Learning to learn, pp. 293–309. Kluwer.Google Scholar
  66. Schmidhuber, J., Zhao, J., & Wiering, M. (1997b). Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28, 105–130.CrossRefGoogle Scholar
  67. Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation.. Published 1977 by Birkhäuser, Basel.Google Scholar
  68. Schwefel, H. P. (1995). Evolution and Optimum Seeking. Wiley Interscience.Google Scholar
  69. Shannon, C. E. (1948). A mathematical theory of communication (parts I and II). Bell System Technical Journal, XXVII, 379–423.MathSciNetGoogle Scholar
  70. Singh, S. (1992). The efficient learning of multiple task sequences. In Moody, J., Hanson, S., & Lippman, R. (Eds.), Advances in Neural Information Processing Systems 4, pp. 251–258 San Mateo, CA. Morgan Kaufmann.Google Scholar
  71. Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22.CrossRefzbMATHMathSciNetGoogle Scholar
  72. Solomonoff, R. (1986). An application of algorithmic probability to problems in artificial intelligence. In Kanal, L. N., & Lemmer, J. F. (Eds.), Uncertainty in Artificial Intelligence, pp. 473–491. Elsevier Science Publishers.Google Scholar
  73. Storck, J., Hochreiter, S., & Schmidhuber, J. (1995). Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the International Conference on Artificial Neural Networks, Paris, Vol. 2, pp. 159–164. EC2 & Cie, Paris.Google Scholar
  74. Sun, R., & Sessions, C. (2000). Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors. IEEE Transactions on Systems, Man, and Cybernetics: Part B Cybernetics, 30(3).Google Scholar
  75. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
  76. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Prieditis, A., & Russell, S. (Eds.), Machine Learning: Proceedings of the Twelfth International Conference, pp. 531–539. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  77. Sutton, R. S., & Pinette, B. (1985). The learning of world models by connectionist networks. Proceedings of the 7th Annual Conference of the Cognitive Science Society, 54–64.Google Scholar
  78. Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switching among temporally abstract actions. In Advances in Neural Information Processing Systems 11. MIT Press. To appear.Google Scholar
  79. Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J. (Ed.), Advances in Genetic Programming, pp. 199–219. MIT Press.Google Scholar
  80. Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2), 215–219.CrossRefGoogle Scholar
  81. Tham, C. (1995). Reinforcement learning of multiple tasks using a hierarchical CMAC architecture. Robotics and Autonomous Systems, 15(4), 247–274.CrossRefGoogle Scholar
  82. Thrun, S., & Möller, K. (1992). Active exploration in dynamic environments. In Lippman, D. S., Moody, J. E., & Touretzky, D. S. (Eds.), Advances in Neural Information Processing Systems 4, pp. 531–538. San Mateo, CA: Morgan Kaufmann.Google Scholar
  83. Wang, G., & Mahadevan, S. (1998). A greedy divide-and-conquer approach to optimizing large manufacturing systems using reinforcement learning. In NIPS’98 Workshop on Abstraction and Hierarchy in Reinforcement Learning.Google Scholar
  84. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.zbMATHGoogle Scholar
  85. Watkins, C. (1989). Learning from Delayed Rewards. Ph.D. thesis, King’s College, Oxford.Google Scholar
  86. Weiss, G. (1994). Hierarchical chunking in classifier systems. In Proceedings of the 12th National Conference on Artificial Intelligence, Vol. 2, pp. 1335–1340. AAAIPress/The MIT Press.Google Scholar
  87. Weiss, G., & Sen, S. (Eds.). (1996). Adaption and Learning in Multi-Agent Systems. LNAI 1042, Springer.Google Scholar
  88. Wiering, M., & Schmidhuber, J. (1998). HQ-learning. Adaptive Behavior, 6(2), 219–246.CrossRefGoogle Scholar
  89. Wiering, M., & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta, L. (Ed.), Machine Learning: Proceedings of the Thirteenth International Conference, pp. 534–542. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  90. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.zbMATHGoogle Scholar
  91. Wilson, S. (1994). ZCS: A zeroth level classifier system. Evolutionary Computation, 2, 1–18.CrossRefGoogle Scholar
  92. Wilson, S. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2), 149–175.CrossRefGoogle Scholar
  93. Wolpert, D. H., Tumer, K., & Frank, J. (1999). Using collective intelligence to route internet traffic. In Kearns, M., Solla, S. A., & Cohn, D. (Eds.), Advances in Neural Information Processing Systems 12. MIT Press, Cambridge MA.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Jürgen Schmidhuber
    • 1
  1. 1.IDSIAManno LuganoSwitzerland

Personalised recommendations