Model-Based Online Learning of POMDPs

  • Guy Shani
  • Ronen I. Brafman
  • Solomon E. Shimony
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)


Learning to act in an unknown partially observable domain is a difficult variant of the reinforcement learning paradigm. Research in the area has focused on model-free methods — methods that learn a policy without learning a model of the world. When sensor noise increases, model-free methods provide less accurate policies. The model-based approach — learning a POMDP model of the world, and computing an optimal policy for the learned model — may generate superior results in the presence of sensor noise, but learning and solving a model of the environment is a difficult problem. We have previously shown how such a model can be obtained from the learned policy of model-free methods, but this approach implies a distinction between a learning phase and an acting phase that is undesirable. In this paper we present a novel method for learning a POMDP model online, based on McCallums’ Utile Suffix Memory (USM), in conjunction with an approximate policy obtained using an incremental POMDP solver. We show that the incrementally improving policy provides superior results to the original USM algorithm, especially in the presence of increasing sensor and action noise.


Markov Decision Process Belief State Action Noise Belief Space Markov Decision Process Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1962)zbMATHGoogle Scholar
  2. 2.
    Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report ICSI-TR-97-021 (1997)Google Scholar
  3. 3.
    Cassandra, A.R., Kaelbling, L.P., Littman, M.L.: Acting optimally in partially observable stochastic domains. In: AAAI 1994, pp. 1023–1028 (1994)Google Scholar
  4. 4.
    Chrisman, L.: Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In: AAAI 2002, pp. 183–188 (1992)Google Scholar
  5. 5.
    Howard, R.A.: Dynamic Programming and Markov Processes. MIT Press, Cambridge (1960)zbMATHGoogle Scholar
  6. 6.
    Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: Scaling up. In: ICML 1995 (1995)Google Scholar
  7. 7.
    McCallum, A.K.: Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester (1996)Google Scholar
  8. 8.
    Meuleau, N., Peshkin, L., Kim, K., Kaelbling, L.P.: Learning finite-state controllers for partially observable environments. In: UAI 1999, pp. 427–436 (1999)Google Scholar
  9. 9.
    Nikovski, D.: State-Aggregation Algorithms for Learning Probabilistic Models for Robot Control. PhD thesis, Carnegie Mellon University (2002)Google Scholar
  10. 10.
    Shani, G., Brafman, R.I.: Resolving perceptual aliasing in the presence of noisy sensors. In: NIPS’17 (2004)Google Scholar
  11. 11.
    Shani, G., Brafman, R.I., Shimony, S.E.: Partial observability under noisy sensors — from model-free to model-based. In: ICML RRfRL Workshop (2005)Google Scholar
  12. 12.
    Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Technical Report IAS-UVA-04-02, University of Amsterdam (2004)Google Scholar
  13. 13.
    Wierstra, D., Wiering, M.: Utile distinction hidden markov models. In: ICML (July 2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Guy Shani
    • 1
  • Ronen I. Brafman
    • 1
  • Solomon E. Shimony
    • 1
  1. 1.Ben-Gurion UniversityBeer-ShevaIsrael

Personalised recommendations