Advertisement

Autonomous Agents and Multi-Agent Systems

, Volume 21, Issue 3, pp 293–320 | Cite as

Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs

  • Christopher AmatoEmail author
  • Daniel S. Bernstein
  • Shlomo Zilberstein
Article

Abstract

POMDPs and their decentralized multiagent counterparts, DEC-POMDPs, offer a rich framework for sequential decision making under uncertainty. Their high computational complexity, however, presents an important research challenge. One way to address the intractable memory requirements of current algorithms is based on representing agent policies as finite-state controllers. Using this representation, we propose a new approach that formulates the problem as a nonlinear program, which defines an optimal policy of a desired size for each agent. This new formulation allows a wide range of powerful nonlinear programming algorithms to be used to solve POMDPs and DEC-POMDPs. Although solving the NLP optimally is often intractable, the results we obtain using an off-the-shelf optimization method are competitive with state-of-the-art POMDP algorithms and outperform state-of-the-art DEC-POMDP algorithms. Our approach is easy to implement and it opens up promising research directions for solving POMDPs and DEC-POMDPs using nonlinear programming methods.

Keywords

Decision theory Multiagent systems Planning under uncertainty POMDPs DEC-POMDPs 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amato, C., Bernstein, D. S., & Zilberstein, S. (2007). Solving POMDPs using quadratically constrained linear programs. In: Proceedings of the twentieth international joint conference on artificial intelligence. (pp. 2418–2424.) Hyderabad, India.Google Scholar
  2. 2.
    Becker R., Zilberstein S., Lesser V., Goldman C.V. (2004) Solving transition-independent decentralized Markov decision processes. Journal of AI Research 22: 423–455zbMATHMathSciNetGoogle Scholar
  3. 3.
    Bernstein, D. S., Hansen, E., & Zilberstein, S. (2005). Bounded policy iteration for decentralized POMDPs. In: Proceedings of the nineteenth international joint conference on artificial intelligence. (pp. 1287–1292). Edinburgh, Scotland.Google Scholar
  4. 4.
    Bernstein D.S., Amato C., Hansen E.A., Zilberstein S. (2009) Policy iteration for decentralized control of Markov decision processes. Journal of AI Research 34: 89–132zbMATHMathSciNetGoogle Scholar
  5. 5.
    Bertsekas D.P. (2004) Nonlinear programming. Belmont, MA, Athena ScientificGoogle Scholar
  6. 6.
    Cassandra, A. R. (1998a). A survey of POMDP applications. In: AAAI fall symposium: Planning with POMDPs. Orlando, FL.Google Scholar
  7. 7.
    Cassandra, A. R. (1998b). Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis. Brown University Providence, RI.Google Scholar
  8. 8.
    Eckles J.E. (1968) Optimum maintenance with incomplete information. Operations Research 16: 1058–1067CrossRefGoogle Scholar
  9. 9.
    Emery-Montemerlo, R., Gordon, G., Schneider, J., & Thrun, S. (2004). Approximate solutions for partially observable stochastic games with common payoffs. In: Proceedings of the third international joint conference on autonomous agents and multiagent systems (pp. 136–143). New York, NY.Google Scholar
  10. 10.
    Gill P. E., Murray W., Saunders M. (2005) Snopt: An SQP algorithm for large-scale constrained optimization. SIAM Review 47: 99–131zbMATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Hansen, E. A. (1998). Solving POMDPs by searching in policy space. In: Proceedings of the fourteenth conference on uncertainty in artificial intelligence. (pp. 211–219). Madison, WI.Google Scholar
  12. 12.
    Hansen, E. A., Bernstein, D. S., & Zilberstein, S. (2004). Dynamic programming for partially observable stochastic games. In: Proceedings of the nineteenth national conference on artificial intelligence. (pp. 709–715). San Jose, CA.Google Scholar
  13. 13.
    Hauskrecht, M., & Fraser, H. (1998). Modeling treatment of ischemic heart disease with partially observable Markov decision processes. In: Proceedings of American medical informatics association annual symposium on computer applications in health care. (pp. 538–542). Orlando, Florida.Google Scholar
  14. 14.
    Horst R., Tuy H. (1996) Global optimization: Deterministic approaches. Springer, New YorkzbMATHGoogle Scholar
  15. 15.
    Ji, S., Parr, R., Li, H., Liao, X., & Carin, L. (2007). Point-based policy iteration. In: Proceedings of the twenty-second national conference on artificial intelligence. (pp. 1243–1249). Vancouver, Canada.Google Scholar
  16. 16.
    Littman M.L., Cassandra A.R., Kaelbling L.P. (1995) Learning policies for partially observable environments: Scaling up. Technical report CS-95-11. Brown University, Department of Computer Science, Providence, RIGoogle Scholar
  17. 17.
    Marecki, J., Gupta, T., Varakantham, P., Tambe, M., & Yokoo, M. (2008). Not all agents are equal: Scaling up distributed POMDPs for agent networks. In: Proceedings of the seventh international joint conference on autonomous agents and multiagent systems. (pp. 485–492). Estoril, Portugal.Google Scholar
  18. 18.
    Meuleau, N., Kim, K. E., Kaelbling, L. P., & Cassandra, A. R. (1999). Solving POMDPs by searching the space of finite policies. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. (pp. 417–426). Stockholm, Sweden.Google Scholar
  19. 19.
    Nair, R., Pynadath, D., Yokoo, M., Tambe, M., & Marsella, S. (2003). Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In: Proceedings of the nineteenth international joint conference on artificial intelligence. (pp. 705–711). Acapulco, Mexico.Google Scholar
  20. 20.
    Petrik, M., & Zilberstein, S. (2007). Average-reward decentralized Markov decision processes. In Proceedings of the twentieth international joint conference on artificial intelligence (pp. 1997–2002). Hyderabad, India.Google Scholar
  21. 21.
    Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In: Proceedings of the eighteenth international joint conference on artificial intelligence. (pp. 1025–1032). Acapulco, Mexico.Google Scholar
  22. 22.
    Poupart, P. (2005). Exploiting structure to efficiently solve large scale partially observable Markov decision processes. PhD thesis. University of Toronto.Google Scholar
  23. 23.
    Poupart P., Boutilier C. (2004) Bounded finite state controllers. In: Thrun S., Saul L., Schölkopf B. (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, MAGoogle Scholar
  24. 24.
    Seuken, S., & Zilberstein, S. (2007a). Memory-bounded dynamic programming for DEC-POMDPs. In: Proceedings of the twentieth international joint conference on artificial intelligence. (pp. 2009–2015). Hyderabad, India.Google Scholar
  25. 25.
    Seuken, S., & Zilberstein, S. (2007b). Improved memory-bounded dynamic programming for decentralized POMDPs. In: Proceedings of the twenty-third conference on uncertainty in artificial intelligence. Vancouver, Canada.Google Scholar
  26. 26.
    Simmons, R., & Koenig, S. (1995). Probabilistic navigation in partially observable environments. In: Proceedings of the fourteenth international joint conference on artificial intelligence. (pp. 1080–1087). Montral, Canada.Google Scholar
  27. 27.
    Singh, S., Jaakkola, T., & Jordan, M. (1994). Learning without state-estimation in partially observable Markovian decision processes. In: Proceedings of the eleventh international conference on machine learning. (pp. 284–292). New Brunswick, NJ.Google Scholar
  28. 28.
    Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In: Proceedings of the twentieth conference on uncertainty in artificial intelligence. (pp. 520–527). Banff, Canada.Google Scholar
  29. 29.
    Smith, T., & Simmons, R. (2005). Point-based POMDP algorithms: Improved analysis and implementation. In: Proceedings of the twenty-first conference on uncertainty in artificial intelligence. Edinburgh, Scotland.Google Scholar
  30. 30.
    Sondik, E. J. (1971). The optimal control of partially observable Markov processes. PhD thesis. Stanford University.Google Scholar
  31. 31.
    Spaan M. T. J., Vlassis N. (2005) Perseus: Randomized point-based value iteration for POMDPs. Journal of AI Research 24: 195–220zbMATHGoogle Scholar
  32. 32.
    Sutton R. S., Barto A. G. (1995) Reinforcement learning: An introduction. MIT Press, Cambridge, MAGoogle Scholar
  33. 33.
    Szer, D., & Charpillet, F. (2005). An optimal best-first search algorithm for solving infinite horizon DEC-POMDPs. In: Proceedings of the sixteenth European conference on machine learning. (pp. 389–399). Porto, Portugal.Google Scholar
  34. 34.
    Szer, D., Charpillet, F., & Zilberstein, S. (2005). MAA*: A heuristic search algorithm for solving decentralized POMDPs. In: Proceedings of the twenty-first conference on uncertainty in artificial intelligence. (pp. 576–583). Edinburgh, Scotland.Google Scholar
  35. 35.
    Wah, B. W., & Chen, Y. (2005). Solving large-scale nonlinear programming problems by constraint partitioning. In: Proceedings of the eleventh international conference on principles and practice of constraint programming. (pp. 697–711).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Christopher Amato
    • 1
    Email author
  • Daniel S. Bernstein
    • 1
  • Shlomo Zilberstein
    • 1
  1. 1.Department of Computer ScienceUniversity of MassachusettsAmherstUSA

Personalised recommendations