Using Rewards for Belief State Updates in Partially Observable Markov Decision Processes

  • Masoumeh T. Izadi
  • Doina Precup
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)

Abstract

Partially Observable Markov Decision Processes (POMDP) provide a standard framework for sequential decision making in stochastic environments. In this setting, an agent takes actions and receives observations and rewards from the environment. Many POMDP solution methods are based on computing a belief state, which is a probability distribution over possible states in which the agent could be. The action choice of the agent is then based on the belief state. The belief state is computed based on a model of the environment, and the history of actions and observations seen by the agent. However, reward information is not taken into account in updating the belief state. In this paper, we argue that rewards can carry useful information that can help disambiguate the hidden state. We present a method for updating the belief state which takes rewards into account. We present experiments with exact and approximate planning methods on several standard POMDP domains, using this belief update method, and show that it can provide advantages, both in terms of speed and in terms of the quality of the solution obtained.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bonet, B.: An epsilon-optimal grid-based algorithm for partially observable Markov decision processes. In: Proceedings of ICML, pp. 51–58 (2002)Google Scholar
  2. 2.
    Cassandra, A.T., Littman, M., Kaelbling L.P.: A simple, fast, exact methods for partially observable Markov decision processes. In: Proceedings of UAI (1997)Google Scholar
  3. 3.
    Cassandra, A.T.: Tony’s POMDP page (1999), http://www.cs.brown.edu/research/ai/pomdp/code/index.html
  4. 4.
    Givan, R., Dean, T., Greig, M.: Equivalence notions and model minimization in Markov Decision Processes. Journal of Artificial Intelligence 147(61), 163–223 (2003)MATHMathSciNetGoogle Scholar
  5. 5.
    Hauskrecht, M.: Value-function approximation for partially observable Markov decision process. Journal of Artificial Intelligence Research 13, 33–94 (2000)MATHMathSciNetGoogle Scholar
  6. 6.
    Izadi, M.T., Rajwade, A., Precup, D.: Using core beliefs for point-based value iteration. In: Proceedings of IJCAI (2005)Google Scholar
  7. 7.
    James, M.R., Singh, S., Littman, M.L.: Planning with Predictive State Representations. In: Proceedings of ICML, pp. 304–311 (2004)Google Scholar
  8. 8.
    Littman, M.L., Sutton, R., Singh, S.: Predictive representations of state. In: Proceedings of NIPS, pp. 1555–1561 (2002)Google Scholar
  9. 9.
    Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: An anytime algorithms for POMDPs. In: Proceedings of IJCAI, pp. 1025–1032 (2003)Google Scholar
  10. 10.
    Poupart, P., Boutilier, C.: Value-directed Compression of POMDPs. In: Proceedings of NIPS, pp. 1547–1554 (2002)Google Scholar
  11. 11.
    Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proceedings of UAI (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Masoumeh T. Izadi
    • 1
  • Doina Precup
    • 1
  1. 1.School of Computer ScienceMcGill UniversityMontrealCanada

Personalised recommendations