Assurance in Reinforcement Learning Using Quantitative Verification

  • George MasonEmail author
  • Radu Calinescu
  • Daniel Kudenko
  • Alec Banks
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 85)


Reinforcement learning (RL) agents converge to optimal solutions for sequential decision making problems. Although increasingly successful, RL cannot be used in applications where unpredictable agent behaviour may have significant unintended negative consequences. We address this limitation by introducing an assured reinforcement learning (ARL) method which uses quantitative verification (QV) to restrict the agent behaviour to areas that satisfy safety, reliability and performance constraints specified in probabilistic temporal logic. To this end, ARL builds an abstract Markov decision process (AMDP) that models the problem to solve at a high level, and uses QV to identify a set of Pareto-optimal AMDP policies that satisfy the constraints. These formally verified abstract policies define areas of the agent behaviour space where RL can occur without constraint violations. We show the effectiveness of our ARL method through two case studies: a benchmark flag-collection navigation task and an assisted-living planning system.


Reinforcement learning Safety Quantitative verification Abstract markov decision processes 



This paper presents research sponsored by the UK MOD. The information contained in it should not be interpreted as representing the views of the UK MOD, nor should it be assumed it reflects any current or future UK MOD policy.


  1. 1.
    Wiering, M., Otterlo, M.: Reinforcement learning and markov decision processes. In: Wiering, M., Otterlo, M. (eds.) Reinforcement Learning: State-of-the-art, pp. 3–42. Springer, Berlin, Heidelberg (2012)CrossRefGoogle Scholar
  2. 2.
    Riedmiller, M., Gabel, T., Hafner, R.: Reinforcement learning for robot soccer. Auton. Robots 27(1), 55–73 (2009)CrossRefGoogle Scholar
  3. 3.
    Kwok, C., Fox, D.: reinforcement learning for sensing strategies. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3158–3163. IEEE (2004)Google Scholar
  4. 4.
    Baxter, J., Tridgell, A., Weaver, L.: KnightCap: A chess program that learns by combining TD(\(\lambda \)) with minimax search. In: 15th International Conference on Machine Learning, pp. 28–36. Morgan Kaufmann (1998)Google Scholar
  5. 5.
    Bagnell, J.A., Schneider, J.G.: autonomous helicopter control using reinforcement learning policy search methods. In: IEEE International Conference on Robotics and Automation. IEEE (2001)Google Scholar
  6. 6.
    Calinescu, R., Ghezzi, C., Kwiatkowska, M., Mirandola, R.: Self-adaptive software needs quantitative verification at runtime. Commun. ACM 55(9), 69–77 (2012)CrossRefGoogle Scholar
  7. 7.
    García, J., Fernández, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Mason, G., Calinescu, R., Kudenko, D., Banks, A.: Combining reinforcement learning and quantitative verification for agent policy assurance. In: 6th International Workshop on Combinations of Intelligent Methods and Applications, pp. 45–52 (2016)Google Scholar
  9. 9.
    Kwiatkowska, M.: Quantitative verification: models, techniques and tools. In: 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 449–458. ACM Press (2007)Google Scholar
  10. 10.
    Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Aspects Comput. 6(5), 512–535 (1994)CrossRefzbMATHGoogle Scholar
  11. 11.
    Marthi, B.: Automatic shaping and decomposition of reward functions. In: 24th International Conference on Machine learning, pp. 601–608. ACM (2007)Google Scholar
  12. 12.
    Lahijanian, M., Andersson, S.B., Belta, C.: Temporal logic motion planning and control with probabilistic satisfaction guarantees. IEEE Trans. Robot. 28(2), 396–409 (2012)CrossRefGoogle Scholar
  13. 13.
    Cizelj, I., Ding, X.C. and Lahijanian, M., Pinto, A., Belta, C.: Probabilistically safe vehicle control in a hostile environment. In: 18th World Congress of the The International Federation of Automatic Control, pp. 11803–11808. (2011)Google Scholar
  14. 14.
    Feng, L., Wiltsche, C., Humphrey, L., Topcu, U.: Synthesis of human-in-the-loop control protocols for autonomous systems. IEEE Trans. Autom. Sci. Eng. 13(2), 450–462 (2016)CrossRefGoogle Scholar
  15. 15.
    Li, L., Walsh, T.J., Littman, M.L.: Towards a unified theory of state abstraction for MDPs. In: 9th International Symposium on Artificial Intelligence and Mathematics, pp. 531–539 (2006)Google Scholar
  16. 16.
    Sutton, R.S., Precup, D., Singh, S.: Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112(1–2), 181–211 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Andova, S., Hermanns, H., Katoen, J.-P.: Discrete-time rewards model-checked. In: Larsen, K.G., Niebert, P. (eds) Formal Modeling and Analysis of Timed Systems. LNCS, vol. 2791, pp. 88–104 (2004)Google Scholar
  18. 18.
    Abe, N., Melville, P., Pendus, C., Reddy, C.K., Jensen, D.L., Thomas, V.P., Bennett, J.J., Anderson, G.F., Cooley, B.R., Kowalczyk, M., Domick, M., Gardinier, T.: Optimizing debt collections using constrained reinforcement learning. In: 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 75–84. ACM (2010)Google Scholar
  19. 19.
    Castro, D.D., Tamar, A., Mannor, S.: Policy gradients with variance related risk criteria. In: 29th International Conference on Machine Learning, pp. 935–942. ACM (2012)Google Scholar
  20. 20.
    Delage, E., Mannor, S.: Percentile optimization for markov decision processes with parameter uncertainty. Oper. Res. 58(1), 203–213 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Geibel, P.: Reinforcement learning for MDPs with constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) 17th European Conference on Machine Learning. LNAI vol. 4212, pp. 646–653. Springer, Heidelberg (2006)Google Scholar
  22. 22.
    Moldovan, T.M., Abbeel, P.: Safe exploration in markov decision processes. In: 29th International Conference on Machine Learning, pp. 1711–1718. ACM (2012)Google Scholar
  23. 23.
    Ponda, S.S., Johnson, L.B., How, J.P.: Risk allocation strategies for distributed chance-constrained task allocation. In: American Control Conference, pp. 3230–3236. IEEE (2013)Google Scholar
  24. 24.
    Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: 15th National Conference on Artificial Intelligence, pp. 761–768. AAAI Press (1998)Google Scholar
  25. 25.
    Boger, J., Hoey, J., Poupart, P., Boutilier, C., Fernie, G., Mihailidis, A.: A planning system based on markov decision processes to guide people with dementia through activities of daily living. IEEE Trans. Inf. Technol. Biomed. 10(2), 323–33 (2006)CrossRefGoogle Scholar
  26. 26.
    Kwiatkowska, M.: advances in quantitative verification for ubiquitous computing. In: Liu, Z., Woodcock, J., Zhu, H. (eds.) 10th International Colloquium on Theoretical Aspects of Computing. LNCS vol. 8049, pp. 42–58. Springer (2013)Google Scholar
  27. 27.
    Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) 23rd International Conference on Computer Aided Verification. LNCS vol. 6806, pp. 585–591. Springer (2011)Google Scholar
  28. 28.
    Katoen, J.-P., Zapreev, I.S., Hahn, E.M., Hermanns, H., Jansen, D.N.: The Ins and Outs of the probabilistic model checker MRMC. Perform. Eval. 68(2), 90–104 (2011)CrossRefGoogle Scholar
  29. 29.
    Kwiatkowska, M., Norman, G., Parker, D.: Stochastic model checking. In: Bernardo, M., Hillston, J. (eds.) 7th International Conference on Formal Methods for Performance Evaluation. LNCS vol. 4486, pp. 220–270. Springer (2007)Google Scholar
  30. 30.
    Watkins, C.J.C.H., Dayan, P.: Q-Learning. Mach. Learn. 8(3), 279–292 (1992)zbMATHGoogle Scholar
  31. 31.
    Xia, L., Jia, Q.-S.: Policy Iteration for parameterized markov decision processes and its application. In: 9th Asian Control Conference, pp. 1–6. IEEE (2013)Google Scholar
  32. 32.
    Liu, C., Xu, X., Hu, D.: Multiobjective reinforcement learning: a comprehensive overview. IEEE Trans. Syst. Man Cybern. Syst. 45(3), 385–398 (2015)CrossRefGoogle Scholar
  33. 33.
    Gerasimou, S., Tamburrelli, G., Calinescu, R.: Search-based synthesis of probabilistic models for quality-of-service software engineering. In: 30th IEEE/ACM International Conference on Automated Software Engineering, pp. 319–330. IEEE (2015)Google Scholar
  34. 34.
    Arcuri, A., Briand, L.: A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: 33rd International Conference on Software Engineering, pp. 1–10. ACM (2011)Google Scholar
  35. 35.
    Calinescu, R., Kikuchi, S., Johnson, K.: Compositional reverification of probabilistic safety properties for large-scale complex IT systems. In: Large-Scale Complex IT Systems. Development, Operation and Management, pp. 303–329. Springer (2012)Google Scholar
  36. 36.
    Calinescu, R., Johnson, K., Rafiq, Y.: Developing self-verifying service-based systems. In: 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 734–737 (2013)Google Scholar
  37. 37.
    Calinescu, R., Gerasimou, S., Banks, A.: Self-adaptive software with decentralised control loops. In: 17th International Conference on Fundamental Approaches to Software Engineering, pp. 235–251. Springer (2015)Google Scholar
  38. 38.
    Gerasimou, S., Calinescu, R., Banks, A.: Efficient runtime quantitative verification using caching, lookahead, and nearly-optimal reconfiguration. In: 9th International Symposium on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), pp. 115–124 (2014)Google Scholar
  39. 39.
    Heger, M.: Consideration of risk in reinforcement learning. In: 11th International Conference on Machine Learning, pp. 105–111. (1994)Google Scholar
  40. 40.
    Mihatsch, O., Neuneier, R.: Risk-sensitive reinforcement learning. Mach. Learn. 49(2), 267–290 (2002)CrossRefzbMATHGoogle Scholar
  41. 41.
    Driessens, K., Džeroski, S.: Integrating Guidance into Relational Reinforcement Learning. Mach. Learn. 57(3), 271–304 (2004)CrossRefzbMATHGoogle Scholar
  42. 42.
    Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., Dekker, E.: Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach. Learn. 84(1), 51–80 (2011)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Gábor, Z., Kalmár, Z., Szepesvári, C.: Multi-criteria reinforcement learning. In: 15th International Conference on Machine Learning, pp. 197–205. Morgan Kaufmann (1998)Google Scholar
  44. 44.
    Mannor, S., Shimkin, N.: A geometric approach to multi-criterion reinforcement learning. J. Mach. Learn. Res. 5, 325–360 (2004)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Barrett, L., Narayanan, S.: learning all optimal policies with multiple criteria. In: 25th International Conference on Machine learning, pp. 41–47. Omni Press (2008)Google Scholar
  46. 46.
    Moffaert, K.V., Nowé, A.: Multi-objective reinforcement learning using sets of pareto dominating policies. J. Mach. Learn. Res. 15(1), 3663–3692 (2014)MathSciNetzbMATHGoogle Scholar
  47. 47.
    Mason, G., Calinescu, R., Kudenko, D., Banks, A.: assured reinforcement learning for safety-critical applications. In: Doctoral Consortium on Agents and Artificial Intelligence (DCAART ’17), pp. 9–16 (2017)Google Scholar
  48. 48.
    Mason, G., Calinescu, R., Kudenko, D., Banks, A.: Assured reinforcement learning with formally verified abstract policies. In: 9th International Conference on Agents and Artificial Intelligence, pp. 105–117 (2017)Google Scholar
  49. 49.
    Calinescu, R., Johnson, K., Rafiq, Y.: Using observation ageing to improve Markovian model learning in QoS engineering. In: 2nd International Conference on Performance Engineering, pp. 505–510 (2011)Google Scholar
  50. 50.
    Calinescu, R., Rafiq, Y., Johnson, K., Bakir, M.E.: Adaptive model learning for continual verification of non-functional properties. In: 5th International Conference on Performance Engineering, pp. 87–98 (2014)Google Scholar
  51. 51.
    Efthymiadis, K., Kudenko, D.: Knowledge revision for reinforcement learning with abstract MDPs. In: Bordini, R.H., Elkind, E., Weiss, G., Yolum, P. (eds.) 14th International Conference on Autonomous Agents and Multiagent Systems, pp. 763–770. (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • George Mason
    • 1
    Email author
  • Radu Calinescu
    • 1
  • Daniel Kudenko
    • 1
    • 2
  • Alec Banks
    • 3
  1. 1.Department of Computer ScienceUniversity of YorkYorkUK
  2. 2.Saint Petersburg National Research Academic University of the Russian Academy of SciencesSt. PetersburgRussia
  3. 3.Defence Science and Technology LaboratorySalisburyUK

Personalised recommendations