Safety-Constrained Reinforcement Learning for MDPs

  • Sebastian Junges
  • Nils JansenEmail author
  • Christian Dehnert
  • Ufuk Topcu
  • Joost-Pieter Katoen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9636)


We consider controller synthesis for stochastic and partially unknown environments in which safety is essential. Specifically, we abstract the problem as a Markov decision process in which the expected performance is measured using a cost function that is unknown prior to run-time exploration of the state space. Standard learning approaches synthesize cost-optimal strategies without guaranteeing safety properties. To remedy this, we first compute safe, permissive strategies. Then, exploration is constrained to these strategies and thereby meets the imposed safety requirements. Exploiting an iterative learning procedure, the resulting strategy is safety-constrained and optimal. We show correctness and completeness of the method and discuss the use of several heuristics to increase its scalability. Finally, we demonstrate the applicability by means of a prototype implementation.



We want to thank Benjamin Lucien Kaminski for the valuable discussion on the worst case size of conflicting sets.


  1. 1.
    Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  2. 2.
    Ciesinski, F., Baier, C.: Liquor: A tool for qualitative and quantitative linear time analysis of reactive systems. In: Proceedings of QEST, pp. 131–132 (2006)Google Scholar
  3. 3.
    Katoen, J.P., Zapreev, I.S., Hahn, E.M., Hermanns, H., Jansen, D.N.: The ins and outs of the probabilistic model checker MRMC. Perform. Eval. 68(2), 90–104 (2011)CrossRefGoogle Scholar
  4. 4.
    Penna, G.D., Intrigila, B., Melatti, I., Tronci, E., Zilli, M.V.: Finite horizon analysis of Markov chains with the Murphi verifier. Softw. Tools Technol. Transf. 8(4–5), 397–409 (2006)CrossRefzbMATHGoogle Scholar
  5. 5.
    Kwiatkowska, M., Norman, G., Parker, D.: The PRISM benchmark suite. In: Proceedings of QEST, pp. 203–204. IEEE CS (2012)Google Scholar
  6. 6.
    Forejt, V., Kwiatkowska, M., Parker, D.: Pareto curves for probabilistic model checking. In: Chakraborty, S., Mukund, M. (eds.) ATVA 2012. LNCS, vol. 7561, pp. 317–332. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  7. 7.
    Etessami, K., Kwiatkowska, M.Z., Vardi, M.Y., Yannakakis, M.: Multi-objective model checking of Markov decision processes. Logical Methods Comput. Sci. 4(4), 1–21 (2008)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Baier, C., Dubslaff, C., Klüppelholz, S.: Trade-off analysis meets probabilistic model checking. In: Proceedings of CSL-LICS, pp. 1:1–1:10. ACM (2014)Google Scholar
  9. 9.
    Sutton, R., Barto, A.: Reinforcement Learning - An Introduction. MIT Press, Cambridge (1998)Google Scholar
  10. 10.
    Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of ICML, pp. 157–163. Morgan Kaufmann (1994)Google Scholar
  11. 11.
    Dräger, K., Forejt, V., Kwiatkowska, M., Parker, D., Ujma, M.: Permissive controller synthesis for probabilistic systems. In: Ábrahám, E., Havelund, K. (eds.) TACAS 2014. LNCS, vol. 8413, pp. 531–546. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  12. 12.
    Wen, M., Ehlers, R., Topcu, U.: Correct-by-synthesis reinforcement learning with temporal logic constraints. CoRR (2015)Google Scholar
  13. 13.
    Moldovan, T.M., Abbeel, P.: Safe exploration in Markov decision processes. In: Proceedings of ICML. (2012)Google Scholar
  14. 14.
    Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Proceedings of RSS (2014)Google Scholar
  15. 15.
    Akametalu, A., Fisac, J., Gillula, J., Kaynama, S., Zeilinger, M., Tomlin, C.: Reachability-based safe learning with Gaussian processes. In: Proceedings of CDC, pp. 1424–1431 (2014)Google Scholar
  16. 16.
    Pecka, M., Svoboda, T.: Safe exploration techniques for reinforcement learning – an overview. In: Hodicky, J. (ed.) MESAS 2014. LNCS, vol. 8906, pp. 357–375. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  17. 17.
    Baier, C., Katoen, J.P.: Principles of Model Checking. The MIT Press, Cambridge (2008)zbMATHGoogle Scholar
  18. 18.
    Sokolova, A., de Vink, E.P.: Probabilistic automata: system types, parallel composition and comparison. In: Baier, C., Haverkort, B.R., Hermanns, H., Katoen, J.-P., Siegle, M. (eds.) Validation of Stochastic Systems. LNCS, vol. 2925, pp. 1–43. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  19. 19.
    Bjørner, N., Phan, A.: \(\nu \)Z - maximal satisfaction with Z3. In: Proceedings of SCSS. EPiC Series, vol. 30, pp. 1–9. EasyChair (2014)Google Scholar
  20. 20.
    Wimmer, R., Jansen, N., Ábrahám, E., Katoen, J.P., Becker, B.: Minimal counterexamples for linear-time probabilistic verification. Theor. Comput. Sci. 549, 61–100 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Stückler, J., Schwarz, M., Schadler, M., Topalidou-Kyniazopoulou, A., Behnke, S.: Nimbro explorer: semiautonomous exploration and mobile manipulation in rough terrain. J. Field Robot. (2015, to appear)Google Scholar
  23. 23.
    de Moura, L., Bjørner, N.S.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Sebastian Junges
    • 1
  • Nils Jansen
    • 1
    • 2
    Email author
  • Christian Dehnert
    • 1
  • Ufuk Topcu
    • 2
  • Joost-Pieter Katoen
    • 1
  1. 1.RWTH Aachen UniversityAachenGermany
  2. 2.University of Texas at AustinAustinUSA

Personalised recommendations