Analyzing and Escaping Local Optima in Planning as Inference for Partially Observable Domains

  • Pascal Poupart
  • Tobias Lang
  • Marc Toussaint
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6912)


Planning as inference recently emerged as a versatile approach to decision-theoretic planning and reinforcement learning for single and multi-agent systems in fully and partially observable domains with discrete and continuous variables. Since planning as inference essentially tackles a non-convex optimization problem when the states are partially observable, there is a need to develop techniques that can robustly escape local optima. We investigate the local optima of finite state controllers in single agent partially observable Markov decision processes (POMDPs) that are optimized by expectation maximization (EM). We show that EM converges to controllers that are optimal with respect to a one-step lookahead. To escape local optima, we propose two algorithms: the first one adds nodes to the controller to ensure optimality with respect to a multi-step lookahead, while the second one splits nodes in a greedy fashion to improve reward likelihood. The approaches are demonstrated empirically on benchmark problems.


Local Optimum Initial Node Policy Iteration Partially Observable Markov Decision Process Successor Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Amato, C., Bernstein, D., Zilberstein, S.: Solving POMDPs using quadratically constrained linear programs. In: IJCAI, pp. 2418–2424 (2007)Google Scholar
  2. 2.
    Amato, C., Bonet, B., Zilberstein, S.: Finite-state controllers based on mealy machines for centralized and decentralized POMDPs. In: AAAI (2010)Google Scholar
  3. 3.
    Braziunas, D., Boutilier, C.: Stochastic local search for POMDP controllers. In: AAAI, pp. 690–696 (2004)Google Scholar
  4. 4.
    Cassandra, A.: Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, Brown University, Dept. of Computer Science (1998)Google Scholar
  5. 5.
    Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. of the Royal Stat. Society, Series B 39(1), 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  6. 6.
    Hansen, E.: An improved policy iteration algorithm for partially observable MDPs. In: NIPS (1998)Google Scholar
  7. 7.
    Hoffman, M., Kueck, H., Doucet, A., de Freitas, N.: New inference strategies for solving Markov decision processes using reversible jump MCMC. In: UAI (2009)Google Scholar
  8. 8.
    Kumar, A., Zilberstein, S.: Anytime planning for decentralized POMDPs using expectation maximization. In: UAI (2010)Google Scholar
  9. 9.
    Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Proc. Robotics: Science and Systems (2008)Google Scholar
  10. 10.
    Littman, M., Cassandra, T., Kaelbling, L.: Learning policies for partially observable environments: scaling up. In: ICML, pp. 362–370 (1995)Google Scholar
  11. 11.
    Meuleau, N., Kim, K.-E., Kaelbling, L., Cassandra, A.: Solving POMDPs by searching the space of finite policies. In: UAI, pp. 417–426 (1999)Google Scholar
  12. 12.
    Pineau, J.: Tractable Planning Under Uncertainty: Exploiting Structure. PhD thesis, Robotics Institute, Carnegie Mellon University (2004)Google Scholar
  13. 13.
    Poupart, P.: Exploiting structure to efficiently solve large scale partially observable Markov decision processes. PhD thesis, University of Toronto (2005)Google Scholar
  14. 14.
    Poupart, P., Boutilier, C.: Bounded finite state controllers. In: NIPS (2003)Google Scholar
  15. 15.
    Siddiqi, S., Gordon, G., Moore, A.: Fast state discovery for HMM model selection and learning. In: AI-STATS (2007)Google Scholar
  16. 16.
    Smith, T., Simmons, R.: Point-based POMDP algorithms: improved analysis and implementation. In: UAI (2005)Google Scholar
  17. 17.
    Sondik, E.: The optimal control of partially observable decision processes over the infinite horizon: Discounted cost. Operations Research 26(2), 282–304 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Toussaint, M., Charlin, L., Poupart, P.: Hierarchical POMDP controller optimization by likelihood maximization. In: UAI (2008)Google Scholar
  19. 19.
    Toussaint, M., Harmeling, S., Storkey, A.: Probabilistic inference for solving (PO)MDPs. Technical Report EDI-INF-RR-0934, School of Informatics, University of Edinburgh (2006)Google Scholar
  20. 20.
    Toussaint, M., Storkey, A.J.: Probabilistic inference for solving discrete and continuous state Markov decision processes. In: ICML (2006)Google Scholar
  21. 21.
    Vlassis, N., Toussaint, M.: Model free reinforcement learning as mixture learning. In: ICML (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Pascal Poupart
    • 1
  • Tobias Lang
    • 2
  • Marc Toussaint
    • 2
  1. 1.David R. Cheriton School of Computer ScienceUniversity of WaterlooCanada
  2. 2.Machine Learning and Robotics LabFU BerlinBerlinGermany

Personalised recommendations