We study systems of multiple reinforcement learners. Each leads a single life lasting from birth to unknown death. In between it tries to accelerate reward intake. Its actions and learning algorithms consume part of its life — computational resources are limited. The expected reward for a certain behavior may change over time, partly because of other learners' actions and learning processes. For such reasons, previous approaches to multi-agent reinforcement learning are either limited or heuristic by nature. Using a simple backtracking method called the “success-story algorithm”, however, at certain times called evaluation points each of our learners is able to establish success histories of behavior modifications: it simply undoes all those of the previous modifications that were not empirically observed to trigger lifelong reward accelerations (computation time for learning and testing is taken into account). Then it continues to act and learn until the next evaluation point. Success histories can be enforced despite interference from other learners. The principle allows for plugging in a wide variety of learning algorithms. An experiment illustrates its feasibility.


Reinforcement Learning Multiagent System Valid Time Policy Modification Program Cell 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    A. G. Barto. Connectionist approaches for control. Technical Report COINS 89-89, University of Massachusetts, Amherst MA 01003, 1989.Google Scholar
  2. 2.
    D. A. Berry and B. Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London, 1985.Google Scholar
  3. 3.
    M. Boddy and T. L. Dean. Deliberation scheduling for problem solving in time-constrained environments. Artificial Intelligence, 67:245–285, 1994.Google Scholar
  4. 4.
    J. C. Gittins. Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization. Wiley, Chichester, NY, 1989.Google Scholar
  5. 5.
    R. Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(2), 1996.Google Scholar
  6. 6.
    T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, Cambridge MA, 1995.Google Scholar
  7. 7.
    L.P. Kaelbling, M.L. Littman, and A.R. Cassandra. Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI, 1995.Google Scholar
  8. 8.
    P. R. Kumar and P. Varaiya. Stochastic Systems: Estimation, Identification, and Adaptive Control. Prentice Hall, 1986.Google Scholar
  9. 9.
    M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable environments: Scaling up. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 362–370. Morgan Kaufmann Publishers, San Francisco, CA, 1995.Google Scholar
  10. 10.
    R. A. McCallum. Overcoming incomplete perception with utile distinction memory. In Machine Learning: Proceedings of the Tenth International Conference. Morgan Kaufmann, Amherst, MA, 1993.Google Scholar
  11. 11.
    M. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712, August 1994.Google Scholar
  12. 12.
    S. Russell and E. Wefald. Principles of Metareasoning. Artificial Intelligence, 49:361–395, 1991.Google Scholar
  13. 13.
    J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 500–506. San Mateo, CA: Morgan Kaufmann, 1991.Google Scholar
  14. 14.
    J. Schmidhuber. A general method for multi-agent learning in unrestricted environments. In Adaptation, Co-evolution and Learning in Multiagent Systems, Technical Report SS-96-01, pages 84–87. American Association for Artificial Intelligence, Menlo Park, Calif., 1996.Google Scholar
  15. 15.
    J. Schmidhuber. A general method for incremental self-improvement and multiagent learning in unrestricted environments. In X. Yao, editor, Evolutionary Computation: Theory and Applications. Scientific Publ. Co., Singapore, 1997. In press.Google Scholar
  16. 16.
    J. Schmidhuber, J. Zhao, and M. Wiering. Simple principles of metalearning. Technical Report IDSIA-69-96, IDSIA, 1996.Google Scholar
  17. 17.
    S. Sen, editor. Adaptation, Co-evolution and Learning in Multiagent Systems, Papers from the 1996 AAAI Symposium, Technical Report SS-96-01. American Association for Artificial Intelligence, Menlo Park, Calif., 1996.Google Scholar
  18. 18.
    R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.Google Scholar
  19. 19.
    C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.Google Scholar
  20. 20.
    G. Weiss, editor. Learning in Distributed Artificial Intelligence Systems, ECAI-96 Workshop Notes, Budapest, Hungary. 1996.Google Scholar
  21. 21.
    G. Weiss and S. Sen, editors. Adaption and Learning in Multi-Agent Systems. LNAI 1042, Springer, 1996.Google Scholar
  22. 22.
    S.D. Whitehead. Reinforcement Learning for the adaptive control of perception and action. PhD thesis, University of Rochester, February 1992.Google Scholar
  23. 23.
    M.A. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA. In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 1996.Google Scholar
  24. 24.
    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.Google Scholar
  25. 25.
    J. Zhao and J. Schmidhuber. Incremental self-improvement for life-time multi-agent reinforcement learning. In Pattie Maes, Maja Mataric, Jean-Arcady Meyer, Jordan Pollack, and Stewart W. Wilson, editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 516–525. MIT Press, Bradford Books, 1996.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Jürgen Schmidhuber
    • 1
  • Jieyu Zhao
    • 1
  1. 1.IDSIALuganoSwitzerland

Personalised recommendations