OnPlan: A Framework for Simulation-Based Online Planning

  • Lenz Belzner
  • Rolf Hennicker
  • Martin Wirsing
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9539)


This paper proposes the OnPlan framework for modeling autonomous systems operating in domains with large probabilistic state spaces and high branching factors. The framework defines components for acting and deliberation, and specifies their interactions. It comprises a mathematical specification of requirements for autonomous systems. We discuss the role of such a specification in the context of simulation-based online planning. We also consider two instantiations of the framework: Monte Carlo Tree Search for discrete domains, and Cross Entropy Open Loop Planning for continuous state and action spaces. The framework’s ability to provide system autonomy is illustrated empirically on a robotic rescue example.


Reward Function System Goal Planning Depth Cross Entropy Domain Dynamic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors thank Andrea Vandin for his help with the MultiVeStA statistical model checker [17].


  1. 1.
    Kolobov, A., Dai, P., Mausam, M., Weld, D.S.: Reverse iterative deepening for finite-horizon MDPS with large branching factors. In: Proceedings of the 22nd International Conference on Automated Planning and Scheduling, ICAPS (2012)Google Scholar
  2. 2.
    Keller, T., Helmert, M.: Trial-based Heuristic Tree Search for Finite Horizon MDPs. In: Proceedings of the 23rd International Conference on Automated Planning and Scheduling (ICAPS 2013), pp. 135–143. AAAI Press, June 2013Google Scholar
  3. 3.
    Weinstein, A.: Local Planning for Continuous Markov Decision Processes. Ph.D. thesis, Rutgers, The State University of New Jersey (2014)Google Scholar
  4. 4.
    Kephart, J.: An architectural blueprint for autonomic computing. IBM (2003)Google Scholar
  5. 5.
    Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)zbMATHMathSciNetCrossRefGoogle Scholar
  6. 6.
    Rubinstein, R.Y., Kroese, D.P.: Simulation and the Monte Carlo Method, vol. 707. Wiley, New York (2011)Google Scholar
  7. 7.
    Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning. Springer Science & Business Media, New York (2013)CrossRefGoogle Scholar
  8. 8.
    Audibert, J.Y., Munos, R., Szepesvári, C.: Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor. Comput. Sci. 410(19), 1876–1902 (2009)zbMATHCrossRefGoogle Scholar
  9. 9.
    Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Game 4(1), 1–43 (2012)CrossRefGoogle Scholar
  10. 10.
    Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Weinstein, A., Littman, M.L.: Open-loop planning in large-scale stochastic domains. In: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (2013)Google Scholar
  12. 12.
    Gelly, S., Kocsis, L., Schoenauer, M., Sebag, M., Silver, D., Szepesvári, C., Teytaud, O.: The grand challenge of computer go: Monte carlo tree search and extensions. Commun. ACM 55(3), 106–113 (2012)CrossRefGoogle Scholar
  13. 13.
    Silver, D., Sutton, R.S., Müller, M.: Temporal-difference search in computer go. In: Borrajo, D., Kambhampati, S., Oddi, A., Fratini, S. (eds.) Proceedings of the Twenty-Third International Conference on Automated Planning and Scheduling, ICAPS 2013, Rome, Italy, June 10–14, 2013. AAAI (2013)Google Scholar
  14. 14.
    Gelly, S., Silver, D.: Monte-carlo tree search and rapid action value estimation in computer go. Artif. Intell. 175(11), 1856–1875 (2011)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)zbMATHCrossRefGoogle Scholar
  16. 16.
    Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)zbMATHCrossRefGoogle Scholar
  17. 17.
    Sebastio, S., Vandin, A.: Multivesta: Statistical model checking for discrete event simulators. In: Proceedings of the 7th International Conference on Performance Evaluation Methodologies and Tools, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), pp. 310–315 (2013)Google Scholar
  18. 18.
    de Boer, P., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Annals OR 134(1), 19–67 (2005)zbMATHCrossRefGoogle Scholar
  19. 19.
    Margolin, L.: On the convergence of the cross-entropy method. Ann. Oper. Res. 134(1), 201–214 (2005)zbMATHMathSciNetCrossRefGoogle Scholar
  20. 20.
    Kobilarov, M.: Cross-entropy motion planning. I. J. Robotic Res. 31(7), 855–871 (2012)CrossRefGoogle Scholar
  21. 21.
    Livingston, S.C., Wolff, E.M., Murray, R.M.: Cross-entropy temporal logic motion planning. In: Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control, HSCC 2015, pp. 269–278 (2015)Google Scholar
  22. 22.
    Box, G.E., Muller, M.E.: A note on the generation of random normal deviates. Ann. Math. Stat. 29, 610–611 (1958)zbMATHCrossRefGoogle Scholar
  23. 23.
    Hester, T., Stone, P.: Texplore: real-time sample-efficient reinforcement learning for robots. Mach. Learn. 90(3), 385–429 (2013)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time dynamic programming. In: ICAPS, vol. 3, pp. 12–21 (2003)Google Scholar
  25. 25.
    Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1238–1246 (2013)Google Scholar
  26. 26.
    Cazenave, T., Pepels, T., Winands, M.H.M., Lanctot, M.: Minimizing simple and cumulative regret in monte-carlo tree search. In: Cazenave, T., Winands, M.H.M., Björnsson, Y. (eds.) CGW 2014. CCIS, vol. 504, pp. 1–15. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  27. 27.
    Mansley, C.R., Weinstein, A., Littman, M.L.: Sample-based planning for continuous action markov decision processes. In: Proceedings of the 21st International Conference on Automated Planning and Scheduling, ICAPS (2011)Google Scholar
  28. 28.
    Weinstein, A., Littman, M.L.: Bandit-based planning and learning in continuous-action markov decision processes. In: Proceedings of the 22nd International Conference on Automated Planning and Scheduling, ICAPS (2012)Google Scholar
  29. 29.
    Baier, C., Katoen, J.P., et al.: Principles of Model Checking, vol. 26202649. MIT Press, Cambridge (2008)zbMATHGoogle Scholar
  30. 30.
    Wirsing, M., Hölzl, M., Koch, N., Mayer, P. (eds.): Software Engineering for Collective Autonomic Systems: Results of the ASCENS Project. LNCS, vol. 8998. Springer, Heidelberg (2015)Google Scholar
  31. 31.
    Hölzl, M.M., Gabor, T.: Continuous collaboration: A case study on the development of an adaptive cyber-physical system. In: 1st IEEE/ACM International Workshop on Software Engineering for Smart Cyber-Physical Systems, SEsCPS 2015, pp. 19–25 (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Institut für InformatikLudwig-Maximilians-Universität MünchenMunichGermany

Personalised recommendations