Teaching Stratego to Play Ball: Optimal Synthesis for Continuous Space MDPs

  • Manfred Jaeger
  • Peter Gjøl JensenEmail author
  • Kim Guldstrand Larsen
  • Axel Legay
  • Sean Sedwards
  • Jakob Haahr Taankvist
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11781)


Formal models of cyber-physical systems, such as priced timed Markov decision processes, require a state space with continuous and discrete components. The problem of controller synthesis for such systems then can be cast as finding optimal strategies for Markov decision processes over a Euclidean state space. We develop two different reinforcement learning strategies that tackle the problem of continuous state spaces via online partition refinement techniques. We provide theoretical insights into the convergence of partition refinement schemes. Our techniques are implemented in Open image in new window. Experimental results show the advantages of our new techniques over previous optimization algorithms of Open image in new window.



This work is partly supported by the Innovation Fund Denmark center DiCyPS, the ERC Advanced Grant LASSO, and the JST ERATO project: HASUO Metamathematics for Systems Design (JPMJER1603).


  1. 1.
    Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72(1–2), 81–138 (1995). ISSN 0004–3702CrossRefGoogle Scholar
  2. 2.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees (1984)Google Scholar
  3. 3.
    D’Argenio, P.R., Jeannet, B., Jensen, H.E., Larsen, K.G.: Reduction and refinement strategies for probabilistic analysis. In: Hermanns, H., Segala, R. (eds.) PAPM-PROBMIV 2002. LNCS, vol. 2399, pp. 57–76. Springer, Heidelberg (2002). Scholar
  4. 4.
    David, A., et al.: Statistical model checking for networks of priced timed automata. In: Fahrenberg, U., Tripakis, S. (eds.) FORMATS 2011. LNCS, vol. 6919, pp. 80–96. Springer, Heidelberg (2011). Scholar
  5. 5.
    David, A., et al.: On time with minimal expected cost!. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 129–145. Springer, Cham (2014). Scholar
  6. 6.
    David, A., Jensen, P.G., Larsen, K.G., Mikučionis, M., Taankvist, J.H.: Uppaal Stratego. In: Baier, C., Tinelli, C. (eds.) TACAS 2015. LNCS, vol. 9035, pp. 206–211. Springer, Heidelberg (2015). Scholar
  7. 7.
    David, A., Larsen, K.G., Legay, A., Mikucionis, M., Poulsen, D.B.: Uppaal SMC tutorial. STTT 17(4), 397–415 (2015). Scholar
  8. 8.
    Henriques, D., Martins, J.G., Zuliani, P., Platzer, A., Clarke, E.M.: Statistical model checking for Markov decision processes. In: QEST 2012, pp. 84–93 (2012).
  9. 9.
    Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neural networks. In: Majumdar, R., Kunčak, V. (eds.) CAV 2017. LNCS, vol. 10426, pp. 3–29. Springer, Cham (2017). Scholar
  10. 10.
    Kwiatkowska, M.Z., Norman, G., Parker, D.: Game-based abstraction for Markov decision processes. In: QEST 2006, pp. 157–166. IEEE Computer Society (2006). ISBN 0-7695-2665-9
  11. 11.
    Larsen, K.G., Mikučionis, M., Taankvist, J.H.: Safe and optimal adaptive cruise control. In: Meyer, R., Platzer, A., Wehrheim, H. (eds.) Correct System Design. LNCS, vol. 9360, pp. 260–277. Springer, Cham (2015). Scholar
  12. 12.
    Larsen, K.G., Mikučionis, M., Muñiz, M., Srba, J., Taankvist, J.H.: Online and compositional learning of controllers with application to floor heating. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 244–259. Springer, Heidelberg (2016). Scholar
  13. 13.
    Larsen, K.G., Le Coënt, A., Mikučionis, M., Taankvist, J.H.: Guaranteed control synthesis for continuous systems in Uppaal Tiga. In: Chamberlain, R., Taha, W., Törngren, M. (eds.) CyPhy/WESE-2018. LNCS, vol. 11615, pp. 113–133. Springer, Cham (2019). ISBN 978-3-030-23703-5CrossRefGoogle Scholar
  14. 14.
    Lun, Y.Z., Wheatley, J., D’Innocenzo, A., Abate, A.: Approximate abstractions of Markov chains with interval decision processes. ADHS 2018, pp. 91–96 (2018). Scholar
  15. 15.
    Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529 (2015)CrossRefGoogle Scholar
  16. 16.
    Strehl, L. Li, A.L., Littman, M.L.: Incremental model-based learners with formal learning-time guarantees. CoRR (2012)Google Scholar
  17. 17.
    Sun, L., Guo, Y., Barbu, A.: A novel framework for online supervised learning with feature selection. arXiv e-prints, art. arXiv:1803.11521 (2018)
  18. 18.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (2018)Google Scholar
  19. 19.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge (1989)Google Scholar
  20. 20.
    Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Manfred Jaeger
    • 1
  • Peter Gjøl Jensen
    • 1
    Email author
  • Kim Guldstrand Larsen
    • 1
  • Axel Legay
    • 1
    • 2
  • Sean Sedwards
    • 3
  • Jakob Haahr Taankvist
    • 1
  1. 1.Department of Computer ScienceAalborg UniversityAalborgDenmark
  2. 2.Université catholique de LouvainOttignies-Louvain-la-NeuveBelgium
  3. 3.University of WaterlooWaterlooCanada

Personalised recommendations