Advertisement

Machine Learning

, Volume 28, Issue 1, pp 105–130 | Cite as

Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement

  • Jürgen Schmidhuber
  • Jieyu Zhao
  • Marco Wiering
Article

Abstract

We study task sequences that allow for speeding up the learner's average reward intake through appropriate shifts of inductive bias (changes of the learner's policy). To evaluate long-term effects of bias shifts setting the stage for later bias shifts we use the “success-story algorithm” (SSA). SSA is occasionally called at times that may depend on the policy itself. It uses backtracking to undo those bias shifts that have not been empirically observed to trigger long-term reward accelerations (measured up until the current SSA call). Bias shifts that survive SSA represent a lifelong success history. Until the next SSA call, they are considered useful and build the basis for additional bias shifts. SSA allows for plugging in a wide variety of learning algorithms. We plug in (1) a novel, adaptive extension of Levin search and (2) a method for embedding the learner's policy modification strategy within the policy itself (incremental self-improvement). Our inductive transfer case studies involve complex, partially observable environments where traditional reinforcement learning fails.

inductive bias reinforcement learning reward acceleration Levin search success-story algorithm incremental self-improvement 

References

  1. Berry, D. A. & Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London.Google Scholar
  2. Boddy, M. & Dean, T. L. (1994). Deliberation scheduling for problem solving in time-constrained environments. Artificial Intelligence, 67:245–285.CrossRefGoogle Scholar
  3. Caruana, R., Silver, D. L., Baxter, J., Mitchell, T. M., Pratt, L. Y., & Thrun, S. (1995). Learning to learn: knowledge consolidation and transfer in inductive systems. Workshop held at NIPS-95, Vail, CO, see http://www.cs.cmu.edu/afs/user/caruana/pub/transfer.html.Google Scholar
  4. Chaitin, G. (1969). On the length of programs for computing finite binary sequences: statistical considerations. Journal of the ACM, 16: 145–159.CrossRefGoogle Scholar
  5. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183–188. AAAI Press, San Jose, California.Google Scholar
  6. Cliff, D. & Ross, S. (1994). Adding temporary memory to ZCS. Adaptive Behavior, 3: 101–150.Google Scholar
  7. Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs. In Grefenstette, J., editor, Proceedings of an International Conference on Genetic Algorithms and Their Applications, Hillsdale NJ. Lawrence Erlbaum Associates.Google Scholar
  8. Dickmanns, D., Schmidhuber, J., & Winklhofer, A. (1987). Der genetische Algorithmus: Eine Implementierung in Prolog. Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl Prof. Radig, Technische Universität München.Google Scholar
  9. Fogel, L., Owens, A., & Walsh, M. (1966). Artificial Intelligence through Simulated Evolution. Willey, New York.Google Scholar
  10. Gittins, J. C. (1989). Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY.Google Scholar
  11. Greiner, R. (1996). PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(2).Google Scholar
  12. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, Cambridge MA.Google Scholar
  13. Kaelbling, L. (1993). Learning in Embedded Systems. MIT Press.Google Scholar
  14. Kaelbling, L., Littman, M., & Cassandra, A. (1995). Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI.Google Scholar
  15. Kolmogorov, A. (1965). Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11.Google Scholar
  16. Koza, J. R. (1992). Genetic evolution and co-evolution of computer programs. In Langton, C., Taylor, C., Farmer, J. D., and Rasmussen, S., editors, Artificial Life II, pages 313–324. Addison Wesley Publishing Company.Google Scholar
  17. Kumar, P. R. & Varaiya, P. (1986). Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall.Google Scholar
  18. Lenat, D. (1983). Theory formation by heuristic search. Machine Learning, 21.Google Scholar
  19. Levin, L. A. (1973). Universal sequential search problems. Problems of Information Transmission, 9(3):265–266.Google Scholar
  20. Levin, L. A. (1984). Randomness conservation inequalities: Information and independence in mathematical theories. Information and Control, 61:15–37.Google Scholar
  21. Li, M. & Vitányi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and its Applications. Springer.Google Scholar
  22. Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh.Google Scholar
  23. Littman, M. (1994). Memoryless policies: Theoretical limitations and practical results. In D. Cliff, P. Husbands, J. A. M. and Wilson, S. W., editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3, pages 297–305. MIT Press/Bradford Books.Google Scholar
  24. McCallum, R. A. (1995). Instance-based utile distinctions for reinforcement learning with hidden state. In Prieditis, A. and Russell, S., editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 387–395. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  25. Narendra, K. S. & Thathatchar, M. A. L. (1974). Learning automata-a survey. IEEE Transactions on Systems, Man, and Cybernetics, 4:323–334.Google Scholar
  26. Pratt, L. & Jennings, B. (1996). A survey of transfer between connectionist networks. Connection Science, 8(2):163–184.Google Scholar
  27. Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712.Google Scholar
  28. Rosenbloom, P. S., Laird, J. E., & Newell, A. (1993). The SOAR Papers. MIT Press.Google Scholar
  29. Russell, S. & Wefald, E. (1991). Principles of Metareasoning. Artificial Intelligence, 49:361–395.Google Scholar
  30. Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-⋯ hook. Institut für Informatik, Technische Universität München.Google Scholar
  31. Schmidhuber, J. (1991). Reinforcement learning in Markovian and non-Markovian environments. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages 500–506. San Mateo, CA: Morgan Kaufmann.Google Scholar
  32. Schmidhuber, J. (1993). A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446–451. Springer.Google Scholar
  33. Schmidhuber, J. (1994). On learning how to learn learning strategies. Technical Report FKI-198-94, Fakultät für Informatik, Technische Universität München. Revised 1995.Google Scholar
  34. Schmidhuber, J. (1995). Discovering solutions with low Kolmogorov complexity and high generalization capability. In Prieditis, A. and Russell, S., editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  35. Schmidhuber, J. (1997a). Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks. In press.Google Scholar
  36. Schmidhuber, J. (1997b). A general method for incremental self-improvement and multi-agent learning in unrestricted environments. In Yao, X., editor, Evolutionary Computation: Theory and Applications. Scientific Publ. Co., Singapore. In press.Google Scholar
  37. Schmidhuber, J., Zhao, J., & Wiering, M. (1996). Simple principles of metalearning. Technical Report IDSIA69-96, IDSIA.Google Scholar
  38. Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation. Published 1977 by Birkhäuser, Basel.Google Scholar
  39. Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7:1–22.Google Scholar
  40. Solomonoff, R. (1986). An application of algorithmic probability to problems in artificial intelligence. In Kanal, L. N. and Lemmer, J. F., editors, Uncertainty in Artificial Intelligence, pages 473–491. Elsevier Science Publishers.Google Scholar
  41. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44.Google Scholar
  42. Utgoff, P. (1986). Shift of bias for inductive concept learning. In Michalski, R., Carbonell, J., and Mitchell, T., editors, Machine Learning, volume 2, pages 163–190. Morgan Kaufmann, Los Altos, CA.Google Scholar
  43. Watanabe, O. (1992). Kolmogorov complexity and computational complexity. EATCS Monographs on Theoretical Computer Science, Springer.Google Scholar
  44. Watkins, C. J. C. H. & Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.Google Scholar
  45. Whitehead, S. & Ballard, D. H. (1990). Active perception and reinforcement learning. Neural Computation, 2(4):409–419.Google Scholar
  46. Wiering, M. & Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta, L., editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA.Google Scholar
  47. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.Google Scholar
  48. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390.Google Scholar
  49. Zhao, J. & Schmidhuber, J. (1996). Incremental self-improvement for life-time multi-agent reinforcement learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 516–525. MIT Press, Bradford Books.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • Jürgen Schmidhuber
    • 1
  • Jieyu Zhao
    • 1
  • Marco Wiering
    • 1
  1. 1.IDSIALuganoSwitzerland

Personalised recommendations