Skip to main content

Potential-based reward shaping for finite horizon online POMDP planning

Abstract

In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. Sorg et al. [22] also propose applying their optimal reward framework to MDPs, which is slightly different from PBRS in that it allows path-dependent reward modifications (as opposed to shaping only values at leaf and initial situations in PBRS, c.f., Sect. 3.2). However, they note that in full breadth planning (as considered in this paper), optimal rewards are equivalent to leaf heuristics, and thus also to PBRS. Therefore, for the remainder of the paper, we only refer to leaf evaluation heuristics, but the same discussions apply to optimal rewards, as well.

  2. We consider the negative of the entropy since entropy measures uncertainty, which is the reciprocal of certainty.

  3. This example is based on the RockSample benchmark problem described in more detail in Sect. 4.1.2 and used in our experimental study evaluating the empirical performance of PBRS for online POMDP planning.

  4. On the other hand, if we used potential function values to determine how to expand plans, then they would simply represent heuristic functions and the result would be a standard heuristic search algorithm. Since our potential functions are used instead for the evaluation of action values, potential functions are orthogonal to heuristic functions.

  5. To increase the complexity of the RockSample benchmark and make it more suitable for our experimental study by making it a little more uncertain like the other benchmark problems considered in this research, we increased the uncertainty in the observations returned when checking rocks by decreasing the half-efficiency distance of sensing from 20 to 1. This is similar to changes made in other experimental studies, including the similar FieldVisionRockSample considered in [18, 25].

  6. We use a different range of allotted times \(\tau \) for different problems due to the different sizes of the POMDPs, resulting in different exponential growth of the planning trees calculated by the agents.

  7. A mixed observability MDP (MOMDP) is a special POMDP representation that factors the state space into fully observable variables \(\mathcal{X}\) and partially observable variables \(\mathcal{Y}\), such that \(S=\mathcal{X}\times \mathcal{Y}\), and exploits this factorization to simplify the transition and observation probability calculations to speed up computation. The resulting model is equivalent (but faster) to the canonical, unfactored POMDP representation for the same problem [15].

  8. Without time constraints, explicit calculations would always be superior because the agent could simply continue planning deeper throughout the entire planning tree. But with time constraints, the agent must of course sacrifice some breadth for depth, causing under- or over-estimations of agent rewards for some belief states, as discussed in Sect. 2.2.

  9. Available online at http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/index.php?n=Main.DownloadDespot .

References

  1. Araya-Lopez, M., Buffet, O., Thomas, V., & Charpillet, F. (2010). A POMDP extension with belief-dependent rewards. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10) (pp. 64–72). Vancouver, B.C., Canada, December 6–9, 2010.

  2. Asmuth, J., Littman, M. L., & Zinkov, R. (2008). Potential-based shaping in model-based reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI’08) (pp. 604–609). Chicago, IL, July 13–17, 2008.

  3. Bertsekas, D. P., & Castanon, D. A. (1999). Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5, 89–108.

    Article  MATH  Google Scholar 

  4. Boutilier, C. (2002). A POMDP formulation of preference elicitation problems. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI’02) (pp. 239–246). Edmonton, Alberta, Canada, July 28–August 1, 2002.

  5. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  6. Devlin, S., & Kudenko, D. (2011). Theoretical considerations of potential-based reward shaping for multi-agent systems. In K. Tumer, P. Yolum, L. Sonenberg, & P. Stone (Eds.), Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Sytems (AAMAS’11) (pp. 225–232). Taipei, Taiwan, May 2–6, 2011.

  7. Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. In V. Conitzer, M. Winikoff, L. Padgham, & W. van der Hoek (Eds.), Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’12). Valencia, Spain, June 6–8, 2012.

  8. Doshi, F., & Roy, N. (2008). The permutable POMDP: Fast solutions to POMDPs for preference elicitation. In L. Padgham, D. C. Parkes, J. Muller & S. Parsons (Eds.), Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’08) (pp. 493–500). Estoril, Portugal, May 12–16, 2008.

  9. Eck, A., Soh, L.-K., Devlin, S., & Kudenko, D. (2013). Potential-based reward shaping for POMDPs (Extended Abstract). In T. Ito, C. Jonker, M. Gini, & O. Shehory (Eds.), Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’13). Saint Paul, Minnesota, May 8–10, 2013.

  10. Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision proceses. Journal of Artificial Intelligence Research, 13, 33–94.

    MathSciNet  MATH  Google Scholar 

  11. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.

    MathSciNet  Article  MATH  Google Scholar 

  12. Kurniawati, H., Hsu, D., & Lee, W. S. (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Proceedings of the 2008 Robotics: Science and Systems Conference (RSS ’08).

  13. Mihaylova, L. et al. (2002). Active sensing for robotics—A survey. In Proceedings of the 5th International Conference on Numerical Methods and Applications (NM&A’02). Borovets, Bulgaria, August 20–24, 2002.

  14. Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping, In Proceedings of the 16th International Conference on Machine Learning (ICML’99) (pp. 278–287). Bled, Slovenia, June 27–30, 1999.

  15. Ong, S. C. W., Png, S. W., Hsu, D., & Lee, W. S. (2010). Planning under uncertainty for robotic tasks with mixed observability. International Journal of Robotics Research, 29(8), 1053–1068.

    Article  Google Scholar 

  16. Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03) (pp. 1025–1032). Acapulco, Mexico, August 9–15, 2003.

  17. Ross, S., & Chaib-draa, B. (2007). AEMS: An anytime online search algorithm for approximate policy refinement in large POMDPs. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07) (pp. 2592–2598). Hyderabad, India, January 6–12, 2007.

  18. Ross, S., Pineau, J., Paquet, S., & Chaib-draa, B. (2008). Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research, 32, 663–704.

    MathSciNet  MATH  Google Scholar 

  19. Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10) (pp. 2164–2172). Vancouver, B.C., Canada, December 6–9, 2010.

  20. Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI’04) (pp. 520–527). Banff, Alberta, Canada, July 7–11, 2004.

  21. Somani, A., Ye, N., Hsu, D., & Sun Lee, W. (2013). DESPOT: Online POMDP planning with regularization. In Advances in Neural Information Processing Systems (NIPS) 2013.

  22. Sorg, J., Singh, S., & Lewis, R. L. (2011). Optimal rewards versus leaf-evaluation heuristics in planning agents. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI’11) (pp. 465–470). San Francisco, CA, August 7–11, 2011.

  23. Spaan, M. T. J., Veiga, T. S., & Lima, P. U. (2010). Active cooperative perception in networked robotic systems using POMDPs. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’10) (pp. 4800–4805). Taipei, Taiwan, October 18–22.

  24. Williams, J. D., & Young, S. (2007). Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21, 393–422.

    Article  Google Scholar 

  25. Zhang, Z. & Chen, X. (2012). FHHOP: A factored heuristic online planning algorithm for POMDPs. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI’12) (pp. 934–943). Catalina Island, USA, August 15–17, 2012.

Download references

Acknowledgments

This research was partially supported by a National Science Foundation Graduate Research Fellowship (DGE-054850) and a Grant from the National Science Foundation (SES-1132015).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Eck.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Eck, A., Soh, LK., Devlin, S. et al. Potential-based reward shaping for finite horizon online POMDP planning. Auton Agent Multi-Agent Syst 30, 403–445 (2016). https://doi.org/10.1007/s10458-015-9292-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10458-015-9292-6

Keywords

  • POMDP
  • Potential-based reward shaping
  • Online planning