Market-Based Reinforcement Learning in Partially Observable Worlds

  • Ivo Kwee
  • Marcus Hutter
  • Jürgen Schmidhuber
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2130)


Unlike traditional reinforcement learning (RL), market-based RL is in principle applicable to worlds described by partially observable Markov Decision Processes (POMDPs), where an agent needs to learn short-term memories of relevant previous events in order to execute optimal actions. Most previous work, however, has focused on reactive settings (MDPs) instead of POMDPs. Here we reimplement a recent approach to market-based RL and for the first time evaluate it in a toy POMDP setting.


Reinforcement Learning Memory Register Observable Markov Decision Process Observable Environment Bucket Brigade 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    E. Baum and I. Durdanovic. Toward a model of mind as an economy of agents. Machine Learning, 35(2):155–185, 1999.zbMATHCrossRefGoogle Scholar
  2. 2.
    E. B. Baum and I. Durdanovic. An evolutionary Post production system. Technical report, NEC Research Institute, Princeton, NJ, January 2000.Google Scholar
  3. 3.
    D. Cliff and S. Ross. Adding temporary memory to ZCS. Adaptive Behavior, 3:101–150, 1994.CrossRefGoogle Scholar
  4. 4.
    J. H. Holland. Properties of the bucket brigade. In Proceedings of an International Conference on Genetic Algorithms. Hillsdale, NJ, 1985.Google Scholar
  5. 5.
    M. Hutter. Towards a universal theory of artificial intelligence based on algorithmic probability and sequential decision theory. Submitted to the 17 th International Joint Conference on Artificial Intelligence (IJCAI-2001), (IDSIA-14-00), December 2000.Google Scholar
  6. 6.
    T. Jaakkola, S. P. Singh, and M. I. Jordan. Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 345–352. MIT Press, Cambridge MA, 1995.Google Scholar
  7. 7.
    L.P. Kaelbling, M.L. Littman, and A.R. Cassandra. Planning and acting in partially observable stochastic domains. Technical report, Brown University, Providence RI, 1995.Google Scholar
  8. 8.
    M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable environments: Scaling up. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 362–370. Morgan Kaufmann Publishers, San Francisco, CA, 1995.Google Scholar
  9. 9.
    R. A. McCallum. Overcoming incomplete perception with utile distinction memory. In Machine Learning: Proceedings of the Tenth International Conference. Morgan Kaufmann, Amherst, MA, 1993.Google Scholar
  10. 10.
    M. B. Ring. Continual Learning in Reinforcement Environments. PhD thesis, University of Texas at Austin, Austin, Texas 78712, August 1994.Google Scholar
  11. 11.
    J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Institut für Informatik, Technische Universität München, 1987.Google Scholar
  12. 12.
    J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403–412, 1989.CrossRefGoogle Scholar
  13. 13.
    J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. InD. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 500–506. San Mateo, CA: Morgan Kaufmann, 1991.Google Scholar
  14. 14.
    J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28:105–130, 1997.CrossRefGoogle Scholar
  15. 15.
    R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.Google Scholar
  16. 16.
    C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.zbMATHGoogle Scholar
  17. 17.
    M. Wiering and J. Schmidhuber. HQ-learning. Adaptive Behavior, 6(2):219–246, 1998.CrossRefGoogle Scholar
  18. 18.
    M.A. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA. In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 1996.Google Scholar
  19. 19.
    S.W. Wilson. ZCS: A zeroth level classifier system. Evolutionary Computation, 2:1–18, 1994.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Ivo Kwee
    • 1
  • Marcus Hutter
    • 1
  • Jürgen Schmidhuber
    • 1
  1. 1.IDSIAMannoSwitzerland

Personalised recommendations