Partially Observable Markov Decision Processes

Spaan, Matthijs T. J.

doi:10.1007/978-3-642-27645-3_12

Matthijs T. J. Spaan^3,4

Part of the book series: Adaptation, Learning, and Optimization ((ALO,volume 12))

29k Accesses
79 Citations

Abstract

For reinforcement learning in environments in which an agent has access to a reliable state signal, methods based on the Markov decision process (MDP) have had many successes. In many problem domains, however, an agent suffers from limited sensing capabilities that preclude it from recovering a Markovian state signal from its perceptions. Extending the MDP framework, partially observable Markov decision processes (POMDPs) allow for principled decision making under conditions of uncertain sensing. In this chapter we present the POMDP model by focusing on the differences with fully observable MDPs, and we show how optimal policies for POMDPs can be represented. Next, we give a review of model-based techniques for policy computation, followed by an overview of the available model-free methods for POMDPs. We conclude by highlighting recent trends in POMDP reinforcement learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aberdeen, D., Baxter, J.: Scaling internal-state policy-gradient methods for POMDPs. In: International Conference on Machine Learning (2002)
Google Scholar
Åström, K.J.: Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications 10(1), 174–205 (1965)
Article MathSciNet MATH Google Scholar
Bagnell, J.A., Kakade, S., Ng, A.Y., Schneider, J.: Policy search by dynamic programming. In: Advances in Neural Information Processing Systems, vol. 16. MIT Press (2004)
Google Scholar
Baird, L., Moore, A.: Gradient descent for general reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 11. MIT Press (1999)
Google Scholar
Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural Information Processing Systems, vol. 14. MIT Press (2002)
Google Scholar
Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)
MathSciNet MATH Google Scholar
Baxter, J., Bartlett, P.L., Weaver, L.: Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research 15, 351–381 (2001)
MathSciNet MATH Google Scholar
Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4), 819–840 (2002)
Article MathSciNet MATH Google Scholar
Bonet, B.: An epsilon-optimal grid-based algorithm for partially observable Markov decision processes. In: International Conference on Machine Learning (2002)
Google Scholar
Boutilier, C., Poole, D.: Computing optimal policies for partially observable decision processes using compact representations. In: Proc. of the National Conference on Artificial Intelligence (1996)
Google Scholar
Brafman, R.I.: A heuristic variable grid solution method for POMDPs. In: Proc. of the National Conference on Artificial Intelligence (1997)
Google Scholar
Braziunas, D., Boutilier, C.: Stochastic local search for POMDP controllers. In: Proc. of the National Conference on Artificial Intelligence (2004)
Google Scholar
Brunskill, E., Kaelbling, L., Lozano-Perez, T., Roy, N.: Continuous-state POMDPs with hybrid dynamics. In: Proc. of the Int. Symposium on Artificial Intelligence and Mathematics (2008)
Google Scholar
Cassandra, A.R.: Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, Brown University (1998)
Google Scholar
Cassandra, A.R., Kaelbling, L.P., Littman, M.L.: Acting optimally in partially observable stochastic domains. In: Proc. of the National Conference on Artificial Intelligence (1994)
Google Scholar
Cassandra, A.R., Kaelbling, L.P., Kurien, J.A.: Acting under uncertainty: Discrete Bayesian models for mobile robot navigation. In: Proc. of International Conference on Intelligent Robots and Systems (1996)
Google Scholar
Cassandra, A.R., Littman, M.L., Zhang, N.L.: Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In: Proc. of Uncertainty in Artificial Intelligence (1997)
Google Scholar
Cheng, H.T.: Algorithms for partially observable Markov decision processes. PhD thesis, University of British Columbia (1988)
Google Scholar
Doshi, F., Roy, N.: The permutable POMDP: fast solutions to POMDPs for preference elicitation. In: Proc. of Int. Conference on Autonomous Agents and Multi Agent Systems (2008)
Google Scholar
Drake, A.W.: Observation of a Markov process through a noisy channel. Sc.D. thesis, Massachusetts Institute of Technology (1962)
Google Scholar
Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts, Amherst (2002)
Google Scholar
Dynkin, E.B.: Controlled random sequences. Theory of Probability and its Applications 10(1), 1–14 (1965)
Article Google Scholar
Ellis, J.H., Jiang, M., Corotis, R.: Inspection, maintenance, and repair with partial observability. Journal of Infrastructure Systems 1(2), 92–99 (1995)
Article Google Scholar
Feng, Z., Zilberstein, S.: Region-based incremental pruning for POMDPs. In: Proc. of Uncertainty in Artificial Intelligence (2004)
Google Scholar
Foka, A., Trahanias, P.: Real-time hierarchical POMDPs for autonomous robot navigation. Robotics and Autonomous Systems 55(7), 561–571 (2007)
Article Google Scholar
Fox, D., Burgard, W., Thrun, S.: Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence Research 11, 391–427 (1999)
MATH Google Scholar
Haight, R.G., Polasky, S.: Optimal control of an invasive species with imperfect information about the level of infestation. Resource and Energy Economics (2010) (in Press, Corrected Proof)
Google Scholar
Hansen, E.A.: Finite-memory control of partially observable systems. PhD thesis, University of Massachusetts, Amherst (1998a)
Google Scholar
Hansen, E.A.: Solving POMDPs by searching in policy space. In: Proc. of Uncertainty in Artificial Intelligence (1998b)
Google Scholar
Hansen, E.A., Feng, Z.: Dynamic programming for POMDPs using a factored state representation. In: Int. Conf. on Artificial Intelligence Planning and Scheduling (2000)
Google Scholar
Hauskrecht, M.: Value function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research 13, 33–95 (2000)
MathSciNet MATH Google Scholar
Hauskrecht, M., Fraser, H.: Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine 18, 221–244 (2000)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Article Google Scholar
Hoey, J., Little, J.J.: Value-directed human behavior analysis from video using partially observable Markov decision processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(7), 1–15 (2007)
Article Google Scholar
Hoey, J., Poupart, P.: Solving POMDPs with continuous or large discrete observation spaces. In: Proc. Int. Joint Conf. on Artificial Intelligence (2005)
Google Scholar
Hsiao, K., Kaelbling, L., Lozano-Perez, T.: Grasping pomdps. In: Proc. of the IEEE Int. Conf. on Robotics and Automation, pp. 4685–4692 (2007)
Google Scholar
Jaakkola, T., Singh, S.P., Jordan, M.I.: Reinforcement learning algorithm for partially observable Markov decision problems. In: Advances in Neural Information Processing Systems, vol. 7 (1995)
Google Scholar
Jaulmes, R., Pineau, J., Precup, D.: Active Learning in Partially Observable Markov Decision Processes. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 601–608. Springer, Heidelberg (2005)
Chapter Google Scholar
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artificial Intelligence 101, 99–134 (1998)
Article MathSciNet MATH Google Scholar
Kearns, M., Mansour, Y., Ng, A.Y.: Approximate planning in large POMDPs via reusable trajectories. In: Advances in Neural Information Processing Systems, vol. 12. MIT Press (2000)
Google Scholar
Koenig, S., Simmons, R.: Unsupervised learning of probabilistic models for robot navigation. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (1996)
Google Scholar
Kurniawati, H., Hsu, D., Lee, W.: SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems (2008)
Google Scholar
Lin, L., Mitchell, T.: Memory approaches to reinforcement learning in non-Markovian domains. Tech. rep., Carnegie Mellon University, Pittsburgh, PA, USA (1992)
Google Scholar
Lin, Z.Z., Bean, J.C., White, C.C.: A hybrid genetic/optimization algorithm for finite horizon, partially observed Markov decision processes. INFORMS Journal on Computing 16(1), 27–38 (2004)
Article MathSciNet MATH Google Scholar
Littman, M.L.: Memoryless policies: theoretical limitations and practical results. In: Proc. of the 3rd Int. Conf. on Simulation of Adaptive Behavior: from Animals to Animats 3, pp. 238–245. MIT Press, Cambridge (1994)
Google Scholar
Littman, M.L.: Algorithms for sequential decision making. PhD thesis, Brown University (1996)
Google Scholar
Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: Scaling up. In: International Conference on Machine Learning (1995)
Google Scholar
Littman, M.L., Sutton, R.S., Singh, S.: Predictive representations of state. In: Advances in Neural Information Processing Systems, vol. 14. MIT Press (2002)
Google Scholar
Loch, J., Singh, S.: Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In: International Conference on Machine Learning (1998)
Google Scholar
Lovejoy, W.S.: Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1), 162–175 (1991)
Article MathSciNet MATH Google Scholar
Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence 147(1-2), 5–34 (2003)
Article MathSciNet MATH Google Scholar
McCallum, R.A.: Overcoming incomplete perception with utile distinction memory. In: International Conference on Machine Learning (1993)
Google Scholar
McCallum, R.A.: Instance-based utile distinctions for reinforcement learning with hidden state. In: International Conference on Machine Learning (1995)
Google Scholar
McCallum, R.A.: Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester (1996)
Google Scholar
Meuleau, N., Kim, K.E., Kaelbling, L.P., Cassandra, A.R.: Solving POMDPs by searching the space of finite policies. In: Proc. of Uncertainty in Artificial Intelligence (1999a)
Google Scholar
Meuleau, N., Peshkin, L., Kim, K.E., Kaelbling, L.P.: Learning finite-state controllers for partially observable environments. In: Proc. of Uncertainty in Artificial Intelligence (1999b)
Google Scholar
Monahan, G.E.: A survey of partially observable Markov decision processes: theory, models and algorithms. Management Science 28(1) (1982)
Google Scholar
Ng, A.Y., Jordan, M.: PEGASUS: A policy search method for large MDPs and POMDPs. In: Proc. of Uncertainty in Artificial Intelligence (2000)
Google Scholar
Oliehoek, F.A., Spaan, M.T.J., Vlassis, N.: Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research 32, 289–353 (2008)
MathSciNet MATH Google Scholar
Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of Markov decision processes. Mathematics of Operations Research 12(3), 441–450 (1987)
Article MathSciNet MATH Google Scholar
Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic domains. In: Proc. Int. Joint Conf. on Artificial Intelligence (1995)
Google Scholar
Peters, J., Bagnell, J.A.D.: Policy gradient methods. In: Springer Encyclopedia of Machine Learning. Springer, Heidelberg (2010)
Google Scholar
Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71, 1180–1190 (2008)
Article Google Scholar
Pineau, J., Thrun, S.: An integrated approach to hierarchy and abstraction for POMDPs. Tech. Rep. CMU-RI-TR-02-21, Robotics Institute, Carnegie Mellon University (2002)
Google Scholar
Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: An anytime algorithm for POMDPs. In: Proc. Int. Joint Conf. on Artificial Intelligence (2003)
Google Scholar
Platzman, L.K.: A feasible computational approach to infinite-horizon partially-observed Markov decision problems. Tech. Rep. J-81-2, School of Industrial and Systems Engineering, Georgia Institute of Technology, reprinted in working notes AAAI, Fall Symposium on Planning with POMDPs (1981)
Google Scholar
Poon, K.M.: A fast heuristic algorithm for decision-theoretic planning. Master’s thesis, The Hong-Kong University of Science and Technology (2001)
Google Scholar
Porta, J.M., Spaan, M.T.J., Vlassis, N.: Robot planning in partially observable continuous domains. In: Robotics: Science and Systems (2005)
Google Scholar
Porta, J.M., Vlassis, N., Spaan, M.T.J., Poupart, P.: Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7, 2329–2367 (2006)
MathSciNet MATH Google Scholar
Poupart, P.: Exploiting structure to efficiently solve large scale partially observable Markov decision processes. PhD thesis, University of Toronto (2005)
Google Scholar
Poupart, P., Boutilier, C.: Bounded finite state controllers. In: Advances in Neural Information Processing Systems, vol. 16. MIT Press (2004)
Google Scholar
Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM (2008)
Google Scholar
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: International Conference on Machine Learning (2006)
Google Scholar
Ross, S., Chaib-draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Information Processing Systems, vol. 20, pp. 1225–1232. MIT Press (2008a)
Google Scholar
Ross, S., Pineau, J., Paquet, S., Chaib-draa, B.: Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research 32, 664–704 (2008b)
MathSciNet Google Scholar
Roy, N., Gordon, G.: Exponential family PCA for belief compression in POMDPs. In: Advances in Neural Information Processing Systems, vol. 15. MIT Press (2003)
Google Scholar
Roy, N., Thrun, S.: Coastal navigation with mobile robots. In: Advances in Neural Information Processing Systems, vol. 12. MIT Press (2000)
Google Scholar
Roy, N., Gordon, G., Thrun, S.: Finding approximate POMDP solutions through belief compression. Journal of Artificial Intelligence Research 23, 1–40 (2005)
Article MATH Google Scholar
Sanner, S., Kersting, K.: Symbolic dynamic programming for first-order POMDPs. In: Proc. of the National Conference on Artificial Intelligence (2010)
Google Scholar
Satia, J.K., Lave, R.E.: Markovian decision processes with probabilistic observation of states. Management Science 20(1), 1–13 (1973)
Article MathSciNet MATH Google Scholar
Seuken, S., Zilberstein, S.: Formal models and algorithms for decentralized decision making under uncertainty. Autonomous Agents and Multi-Agent Systems (2008)
Google Scholar
Shani, G., Brafman, R.I.: Resolving perceptual aliasing in the presence of noisy sensors. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 17, pp. 1249–1256. MIT Press, Cambridge (2005)
Google Scholar
Shani, G., Brafman, R.I., Shimony, S.E.: Model-Based Online Learning of POMDPs. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 353–364. Springer, Heidelberg (2005)
Chapter Google Scholar
Shani, G., Brafman, R.I., Shimony, S.E.: Forward search value iteration for POMDPs. In: Proc. Int. Joint Conf. on Artificial Intelligence (2007)
Google Scholar
Shani, G., Poupart, P., Brafman, R.I., Shimony, S.E.: Efficient ADD operations for point-based algorithms. In: Int. Conf. on Automated Planning and Scheduling (2008)
Google Scholar
Silver, D., Veness, J.: Monte-carlo planning in large POMDPs. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 2164–2172 (2010)
Google Scholar
Simmons, R., Koenig, S.: Probabilistic robot navigation in partially observable environments. In: Proc. Int. Joint Conf. on Artificial Intelligence (1995)
Google Scholar
Singh, S., Jaakkola, T., Jordan, M.: Learning without state-estimation in partially observable Markovian decision processes. In: International Conference on Machine Learning (1994)
Google Scholar
Singh, S., James, M.R., Rudary, M.R.: Predictive state representations: A new theory for modeling dynamical systems. In: Proc. of Uncertainty in Artificial Intelligence (2004)
Google Scholar
Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable Markov decision processes over a finite horizon. Operations Research 21, 1071–1088 (1973)
Article MATH Google Scholar
Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proc. of Uncertainty in Artificial Intelligence (2004)
Google Scholar
Smith, T., Simmons, R.: Point-based POMDP algorithms: Improved analysis and implementation. In: Proc. of Uncertainty in Artificial Intelligence (2005)
Google Scholar
Sondik, E.J.: The optimal control of partially observable Markov processes. PhD thesis, Stanford University (1971)
Google Scholar
Spaan, M.T.J., Vlassis, N.: A point-based POMDP algorithm for robot planning. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (2004)
Google Scholar
Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24, 195–220 (2005a)
MATH Google Scholar
Spaan, M.T.J., Vlassis, N.: Planning with continuous actions in partially observable environments. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (2005b)
Google Scholar
Spaan, M.T.J., Veiga, T.S., Lima, P.U.: Active cooperative perception in network robot systems using POMDPs. In: Proc. of International Conference on Intelligent Robots and Systems (2010)
Google Scholar
Sridharan, M., Wyatt, J., Dearden, R.: Planning to see: A hierarchical approach to planning visual actions on a robot using POMDPs. Artificial Intelligence 174, 704–725 (2010)
Article Google Scholar
Stankiewicz, B., Cassandra, A., McCabe, M., Weathers, W.: Development and evaluation of a Bayesian low-vision navigation aid. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 37(6), 970–983 (2007)
Article Google Scholar
Stratonovich, R.L.: Conditional Markov processes. Theory of Probability and Its Applications 5(2), 156–178 (1960)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Google Scholar
Theocharous, G., Mahadevan, S.: Approximate planning with hierarchical partially observable Markov decision processes for robot navigation. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (2002)
Google Scholar
Thrun, S.: Monte Carlo POMDPs. In: Advances in Neural Information Processing Systems, vol. 12. MIT Press (2000)
Google Scholar
Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press (2005)
Google Scholar
Varakantham, P., Maheswaran, R., Tambe, M.: Exploiting belief bounds: Practical POMDPs for personal assistant agents. In: Proc. of Int. Conference on Autonomous Agents and Multi Agent Systems (2005)
Google Scholar
Vlassis, N., Toussaint, M.: Model-free reinforcement learning as mixture learning. In: International Conference on Machine Learning, pp. 1081–1088. ACM (2009)
Google Scholar
Wang, C., Khardon, R.: Relational partially observable MDPs. In: Proc. of the National Conference on Artificial Intelligence (2010)
Google Scholar
White, C.C.: Partially observed Markov decision processes: a survey. Annals of Operations Research 32 (1991)
Google Scholar
Wiering, M., Schmidhuber, J.: HQ-learning. Adaptive Behavior 6(2), 219–246 (1997)
Article Google Scholar
Wierstra, D., Wiering, M.: Utile distinction hidden Markov models. In: International Conference on Machine Learning (2004)
Google Scholar
Williams, J.D., Young, S.: Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language 21(2), 393–422 (2007)
Article Google Scholar
Williams, J.K., Singh, S.: Experimental results on learning stochastic memoryless policies for partially observable Markov decision processes. In: Advances in Neural Information Processing Systems, vol. 11 (1999)
Google Scholar
Zhang, N.L., Liu, W.: Planning in stochastic domains: problem characteristics and approximations. Tech. Rep. HKUST-CS96-31, Department of Computer Science, The Hong Kong University of Science and Technology (1996)
Google Scholar
Zhou, R., Hansen, E.A.: An improved grid-based approximation algorithm for POMDPs. In: Proc. Int. Joint Conf. on Artificial Intelligence (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Systems and Robotics, Instituto Superior Técnico, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
Matthijs T. J. Spaan
Delft University of Technology, Delft, The Netherlands
Matthijs T. J. Spaan

Authors

Matthijs T. J. Spaan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fac. Mathematics &, Natural Sciences, University of Groningen, Nijenborgh 9, Groningen, 9747 AG, Netherlands
Marco Wiering
, Artificial Intelligence, Radboud University Nijmegen, B.02.30 Spinozagebouw, Montessorilaan 3, Nijmegen, 6500, Netherlands
Martijn van Otterlo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Spaan, M.T.J. (2012). Partially Observable Markov Decision Processes. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-27645-3_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27644-6
Online ISBN: 978-3-642-27645-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics