A survey of point-based POMDP solvers

Article

Abstract

The past decade has seen a significant breakthrough in research on solving partially observable Markov decision processes (POMDPs). Where past solvers could not scale beyond perhaps a dozen states, modern solvers can handle complex domains with many thousands of states. This breakthrough was mainly due to the idea of restricting value function computations to a finite subset of the belief space, permitting only local value updates for this subset. This approach, known as point-based value iteration, avoids the exponential growth of the value function, and is thus applicable for domains with longer horizons, even with relatively large state spaces. Many extensions were suggested to this basic idea, focusing on various aspects of the algorithm—mainly the selection of the belief space subset, and the order of value function updates. In this survey, we walk the reader through the fundamentals of point-based value iteration, explaining the main concepts and ideas. Then, we survey the major extensions to the basic algorithm, discussing their merits. Finally, we include an extensive empirical analysis using well known benchmarks, in order to shed light on the strengths and limitations of the various approaches.

Keywords

Partially observable Markov decision processes Decision-theoretic planning Reinforcement learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Albore, A., Palacios, H., & Geffner, H. (2009). A translation-based approach to contingent planning. In International joint conference on artificial intelligence (IJCAI) (pp. 1623–1628).Google Scholar
  2. 2.
    Armstrong-Crews, N., Gordon, G., & Veloso, M. (2008). Solving POMDPs from both sides: Growing dual parsimonious bounds. In AAAI workshop for advancement in POMDP solvers.Google Scholar
  3. 3.
    Atrash A., Kaplow R., Villemure J., West R., Yamani H., Pineau J. (2009) Development and validation of a robust speech interface for improved human-robot interaction. International Journal of Social Robotics 1: 345–356CrossRefGoogle Scholar
  4. 4.
    Barto A. G., Bradtke S. J., Singh S. P. (1995) Learning to act using real-time dynamic programming. Artificial Intelligence 72: 81–138. doi:10.1016/0004-3702(94)00011-O CrossRefGoogle Scholar
  5. 5.
    Bellman R. (1957) Dynamic programming. Princeton University Press, PrincetonMATHGoogle Scholar
  6. 6.
    Bellman R. (1957) A Markovian decision process. Journal of Mathematics and Mechanics 6: 679–684MATHGoogle Scholar
  7. 7.
    Bonet, B., & Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. In International conference on planning and scheduling (ICAPS) (pp. 12–31).Google Scholar
  8. 8.
    Bonet, B., & Geffner, H. (2009). Solving POMDPs: RTDP-Bel vs. Point-based algorithms. In International joint conference on artificial intelligence (IJCAI) (pp. 1641–1646).Google Scholar
  9. 9.
    Boutilier, C. (2002). A POMDP formulation of preference elicitation problems. In National conference on artificial intelligence (AAAI) (pp. 239–246).Google Scholar
  10. 10.
    Brunskill, E., Kaelbling, L., Lozano-Perez, T., & Roy, N. (2008). Continuous-state POMDPs with hybrid dynamics. In International symposium on artificial intelligence and mathematics (ISAIM).Google Scholar
  11. 11.
    Cassandra, A., Littman, M. L., & Zhang, N. L. (1997). Incremental Pruning: A simple, fast, exact method for partially observable Markov decision processes. In Conference on uncertainty in artificial intelligence (UAI) (pp. 54–61). http://www.cs.duke.edu/~mlittman/docs/uai97-pomdp.ps.
  12. 12.
    Dai, P., & Goldsmith, J. (2007). Topological value iteration algorithm for Markov decision processes. In: International joint conference on artificial intelligence (IJCAI) (pp. 1860–1865)Google Scholar
  13. 13.
    Dibangoye, J. S., Shani, G., Chaib-draa, B., & Mouaddib, A. I. (2009). Topological order planner for POMDPs. In International joint conference on artificial intelligence (IJCAI) (pp. 1684–1689).Google Scholar
  14. 14.
    Doshi, F., & Roy, N. (2008). The permutable POMDP: Fast solutions to POMDPs for preference elicitation. In International conference on autonomous agents and multiagent systems (AAMAS) (pp. 493–500).Google Scholar
  15. 15.
    Geffner, H., & Bonet, B. (1998). Solving large POMDPs using real time dynamic programming. In Proceedings AAAI fall symposium on POMDPs.Google Scholar
  16. 16.
    Hansen, E. (1998). Solving POMDPs by searching in policy space. In: Conference on uncertainty in artificial intelligence (UAI)(pp. 211–219).Google Scholar
  17. 17.
    Hansen, E. A. (2007). Indefinite-horizon POMDPs with action-based termination. In National conference on artificial intelligence (AAAI) (pp. 1237–1242).Google Scholar
  18. 18.
    Hauskrecht, M. (1997). Incremental methods for computing bounds in partially observable Markov decision processes. In: National conference on artificial intelligence (pp. 734–739).Google Scholar
  19. 19.
    Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research (JAIR), 13, 33–94. http://www.cs.washington.edu/research/jair/abstracts/hauskrecht00a.html.
  20. 20.
    Hauskrecht M., Fraser H. S. F. (2000) Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine 18(3): 221–244CrossRefGoogle Scholar
  21. 21.
    Hoey J., Poupart P., von Bertoldi A., Craig T., Boutilier C., Mihailidis A. (2010) Automated handwashing assistance for persons with dementia using video and a partially observable Markov decision process. Computer Vision and Image Understanding 114(5): 503–519CrossRefGoogle Scholar
  22. 22.
    Howard R. A. (1960) Dynamic programming and Markov processes. MIT Press, Cambridge, MAMATHGoogle Scholar
  23. 23.
    Hsiao, K., Kaelbling, L. P., & Lozano-Pérez, T. (2007). Grasping POMDPs. In IEEE international conference on robotics and automation (ICRA) (pp. 4685–4692).Google Scholar
  24. 24.
    Huynh, V. A., & Roy N. (2009). icLQG: Combining local and global optimization for control in information space. In IEEE international conference on robotics and automation (ICRA) (pp. 2851–2858).Google Scholar
  25. 25.
    Izadi, M. T., Rajwade, A. V., & Precup, D. (2005). Using core beliefs for point-based value iteration. In International joint conference on artificial intelligence (pp. 1751–1753).Google Scholar
  26. 26.
    Izadi, M. T., Precup, D., & Azar, D. (2006). Belief selection in point-based planning algorithms for POMDPs. In Canadian conference on artificial intelligence (pp. 383–394).Google Scholar
  27. 27.
    Ji, S., Parr, R., Li, H., Liao, X., & Carin, L. (2007). Point-based policy iteration. In National conference on artificial intelligence (AAAI) (pp. 1243–1249). AAAI Press.Google Scholar
  28. 28.
    Kaelbling, L., Littman, M., & Cassandra, A. (1998). Planning and acting in partially observable stochastic domains. In Artificial intelligence (pp. 99–134).Google Scholar
  29. 29.
    Kaplow, R. (2010). Point-based POMDP solvers: Survey and comparative analysis. Master’s thesis, McGill University.Google Scholar
  30. 30.
    Kurniawati, H., Hsu, D., & Lee, W. (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and systems (RSS).Google Scholar
  31. 31.
    Littman, M. L. (1996). Algorithms for sequential decision making. PhD thesis, Department of Computer Science, Brown University, Providence, RI. ftp://ftp.cs.brown.edu/pub/techreports/96/cs96-09.ps.Z. Also Technical Report CS-96-09.
  32. 32.
    Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially observable environments: Scaling up. In International conference on machine learning (ICML) (pp. 362–370).Google Scholar
  33. 33.
    Littman, M. L., Sutton, R. S., & Singh, S. P. (2001). Predictive representations of state. In Advances in neural information processing systems (NIPS) (pp. 1555–1561).Google Scholar
  34. 34.
    Littman, M. L., Ravi, N., Fenson, E., & Howard, R. (2004). An instance-based state representation for network repair. In National conference on artificial intelligence (AAAI) (pp. 287–292).Google Scholar
  35. 35.
    Lovejoy W. S. (1991) Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1): 162–175MathSciNetMATHCrossRefGoogle Scholar
  36. 36.
    Ng, A., Harada, D., & Russell, S. (1999). Policy invariance underreward transformations: Theory and application to reward shaping. In International conference on machine learning (ICML).Google Scholar
  37. 37.
    Pineau, J., & Gordon, G. (2005). POMDP planning for robust robot control. In International symposium on robotics research (ISRR) (Vol. 28, pp. 69–82). Springer.Google Scholar
  38. 38.
    Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In International joint conference on artificial intelligence (pp. 1025–1032).Google Scholar
  39. 39.
    Pineau, J., Gordon, G. J., & Thrun, S. (2003). Applying metric-trees to belief-point POMDPs. In Advances in neural information processing systems (NIPS).Google Scholar
  40. 40.
    Pineau J., Gordon G. J., Thrun S. (2006) Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research (JAIR) 27: 335–380MATHGoogle Scholar
  41. 41.
    Poon, L. (2001). A fast heuristic algorithm for decision theoretic planning. Master’s thesis, The Hong-Kong University of Science and Technology.Google Scholar
  42. 42.
    Porta J. M., Vlassis N., Spaan M. T. J., Poupart P. (2006) Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7: 2329–2367MathSciNetMATHGoogle Scholar
  43. 43.
    Poupart, P. (2005). Exploiting structure to efficiently solve large scale partially observable Markov decision processes. PhD thesis, Department of Computer Science, University of Toronto.Google Scholar
  44. 44.
    Poupart, P., & Boutilier, C. (2003). Bounded finite state controllers. In Advances in neural information processing systems (NIPS) Google Scholar
  45. 45.
    Poupart, P., Kim, K. E., & Kim, D. (2011). Closing the gap: Improved bounds on optimal POMDP solutions. In International conference on planning and scheduling (ICAPS).Google Scholar
  46. 46.
    Puterman M. L. (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York, NYMATHCrossRefGoogle Scholar
  47. 47.
    Ross, S., & Chaib-draa, B. (2007). AEMS: An anytime online search algorithm for approximate policy refinement in large POMDPs. In International joint conference on artificial intelligence (IJCAI) (pp. 2592–2598).Google Scholar
  48. 48.
    Sanner, S., & Kersting, K. (2010). Symbolic dynamic programming for first-order POMDPs. In National conference on artificial intelligence (AAAI).Google Scholar
  49. 49.
    Shani G. (2010) Evaluating point-based POMDP solvers on multicore machines. IEEE Transactions on Systems, Man, and Cybernetics, Part B 40(4): 1062–1074CrossRefGoogle Scholar
  50. 50.
    Shani, G., & Meek, C. (2009). Improving existing fault recovery policies. In Advances in neural information processing systems (NIPS) (Vol. 22, pp. 1642–1650).Google Scholar
  51. 51.
    Shani G., Heckerman D., Brafman R. I. (2005) An MDP-based recommender system. Journal of Machine Learning Research 6: 1265–1295MathSciNetMATHGoogle Scholar
  52. 52.
    Shani, G., Brafman, R., & Shimony, S. (2007). Forward search value iteration for POMDPs. In International joint conference on artificial intelligence (IJCAI).Google Scholar
  53. 53.
    Shani G., Brafman R. I., Shimony S. E. (2008) Prioritizing point-based POMDP solvers. IEEE Transactions on Systems, Man, and Cybernetics, Part B 38(6): 1592–1605CrossRefGoogle Scholar
  54. 54.
    Shani, G., Poupart, P., Brafman, R. I., & Shimony, S. E. (2008). Efficient ADD operations for point-based algorithms. In International conference on automated scheduling and planning (ICAPS) (pp. 330–337).Google Scholar
  55. 55.
    Sim, H. S., Kim, K. E., Kim, J. H., Chang, D. S., & Koo, M. W. (2008). Symbolic heuristic search value iteration for factored POMDPs. In National conference on artificial intelligence (pp. 1088–1093).Google Scholar
  56. 56.
    Singh S. P., Sutton R. S. (1996) Reinforcement learning with replacing eligibility traces. Machine Learning 22: 123–158MATHGoogle Scholar
  57. 57.
    Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Conference on uncertainty in artificial intelligence (UAI).Google Scholar
  58. 58.
    Smith, T., & Simmons, R. G. (2005). Point-based POMDP algorithms: Improved analysis and implementation. In Conference on uncertainty in artificial intelligence (UAI) (pp. 542–547).Google Scholar
  59. 59.
    Sondik, E. (1971). The optimal control of partially observable Markov decision processes. PhD thesis, Stanford University.Google Scholar
  60. 60.
    Sondik E. J. (1978) The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations Research 26: 282–304MathSciNetMATHCrossRefGoogle Scholar
  61. 61.
    Spaan, M., & Vlassis, N. (2004). A point-based POMDP algorithm for robot planning. In IEEE international conference on robotics and automation (ICRA) (pp. 2399–2404).Google Scholar
  62. 62.
    Spaan M., Vlassis N. (2005) Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24: 195–220MATHGoogle Scholar
  63. 63.
    Sutton R. S., Barto A. G. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MAGoogle Scholar
  64. 64.
    Szepesvari, C. (2009). Reinforcement learning algorithms for MDPs—a survey. Technical report TR09-13, University Of Alberta.Google Scholar
  65. 65.
    Virin, Y., Shani, G., Shimony, S. E., & Brafman, R. I. (2007). Scaling up: Solving POMDPs through value based clustering. In: National conference on artificial intelligence (AAAI) (pp. 1290–1295).Google Scholar
  66. 66.
    Wang, C., & Khardon, R. (2010). Relational partially observable mdps. In National conference on artificial intelligence (AAAI).Google Scholar
  67. 67.
    Williams J. D., Young S. (2007) Partially observable Markov decision processes for spoken dialog systems. Computer Speech & Language 21(2): 393–422CrossRefGoogle Scholar
  68. 68.
    Wingate D., Seppi K. D. (2005) Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research (JMLR) 6: 851–881MathSciNetMATHGoogle Scholar
  69. 69.
    Zhang N. L., Zhang S. (2001) Speeding up the convergence of value iteration in partially observable Markov decision processes. Journal of Artificial Intelligence Research (JAIR) 14: 29–51Google Scholar
  70. 70.
    Zilberstein S. (1996) Using anytime algorithms in intelligent systems. AI Magazine 17: 73–83Google Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Information Systems EngineeringBen Gurion UniversityBeershebaIsrael
  2. 2.School of Computer ScienceMcGill UniversityMontrealCanada

Personalised recommendations