Applied Intelligence

, Volume 48, Issue 4, pp 886–908 | Cite as

Verification and repair of control policies for safe reinforcement learning

  • Shashank Pathak
  • Luca Pulina
  • Armando TacchellaEmail author


Reinforcement Learning is a well-known AI paradigm whereby control policies of autonomous agents can be synthesized in an incremental fashion with little or no knowledge about the properties of the environment. We are concerned with safety of agents whose policies are learned by reinforcement, i.e., we wish to bound the risk that, once learning is over, an agent damages either the environment or itself. We propose a general-purpose automated methodology to verify, i.e., establish risk bounds, and repair policies, i.e., fix policies to comply with stated risk bounds. Our approach is based on probabilistic model checking algorithms and tools, which provide theoretical and practical means to verify risk bounds and repair policies. Considering a taxonomy of potential repair approaches tested on an artificially-generated parametric domain, we show that our methodology is also more effective than comparable ones.


Robust AI Reinforcement learning Probabilistic model checking 


  1. 1.
    Abrahám E, Jansen N, Wimmer R, Katoen J, Becker B (2010) Dtmc model checking by scc reduction. In: 2010 7th international conference on the quantitative evaluation of systems (QEST). IEEE, pp 37–46Google Scholar
  2. 2.
    Aziz A, Singhal V, Balarin F, Brayton RK, Sangiovanni-Vincentell AL (1995) It usually works: the temporal logic of stochastic systems. In: Computer aided verification. Springer, pp 155–165Google Scholar
  3. 3.
    Avriel M (2003) Nonlinear programming: analysis and methods. Courier CorporationGoogle Scholar
  4. 4.
    Bentivegna DC, Atkeson CG, Ude A, Cheng G (2004) Learning to act from observation and practice. Int J Human Robot 1(4)Google Scholar
  5. 5.
    Barto A, Crites RH (1996) Improving elevator performance using reinforcement learning. Adv Neural Inf Process Syst 8:1017–1023Google Scholar
  6. 6.
    Boutilier C, Dean T, Hanks S (1999) Decision-theoretic planning: structural assumptions and computational leverage. J Artif Intell Res 11(1):94MathSciNetzbMATHGoogle Scholar
  7. 7.
    Buccafurri F, Eiter T, Gottlob G, Leone N et al (1999) Enhancing model checking in verification by ai techniques. Artif Intell 112(1):57–104MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Bartocci E, Grosu R, Katsaros P, Ramakrishnan C, Smolka S (2011) Model repair for probabilistic systems. Tools Algor Construct Anal Syst 326–340Google Scholar
  9. 9.
    Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications, vol 15. Springer Science & Business MediaGoogle Scholar
  10. 10.
    Barrett L, Narayanan S (2008) Learning all optimal policies with multiple criteria. In: Proceedings of the 25th international conference on machine learning. ACM, pp 41–47Google Scholar
  11. 11.
    Biegler LT, Zavala VM (2009) Large-scale nonlinear programming using ipopt: an integrating framework for enterprise-wide dynamic optimization. Comput Chem Eng 33(3):575–582CrossRefGoogle Scholar
  12. 12.
    Cicala G, Khalili A, Metta G, Natale L, Pathak S, Pulina L, Tacchella A (2014) Engineering approaches and methods to verify software in autonomous systems. In: 13th international conference on intelligent autonomous systems (IAS-13)Google Scholar
  13. 13.
    Courcoubetis C, Yannakakis M (1995) The complexity of probabilistic verification. J ACM (JACM) 42(4):857–907MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Daws C (2005) Symbolic and parametric model checking of discrete-time Markov chains. In: Theoretical aspects of computing-ICTAC 2004. Springer, pp 280–294Google Scholar
  15. 15.
    Filieri A, Ghezzi C, Tamburrelli G (2011) Run-time efficient probabilistic model checking. In: Proceedings of the 33rd international conference on software engineering. ACM, pp 341–350Google Scholar
  16. 16.
    Garcıa J, Fernández F (2015) A comprehensive survey on safe reinforcement learning. J Mach Learn Res 16(1):1437–1480MathSciNetzbMATHGoogle Scholar
  17. 17.
    Ghallab M, Nau D, Traverso P (2004) Automated planning: theory & practice. ElsevierGoogle Scholar
  18. 18.
    Gordon DF (2000) Asimovian adaptive agents. J Artif Intell Res 13(1):95–153MathSciNetzbMATHGoogle Scholar
  19. 19.
    Grinstead CM, Snell JL (1988) Introduction to probability. American Mathematical Soc. Chapter 11Google Scholar
  20. 20.
    Gillula JH, Tomlin CJ (2012) Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In: ICRA, pp 2723–2730Google Scholar
  21. 21.
    Geibel P, Wysotzki F (2005) Risk-sensitive reinforcement learning applied to control under constraints. J Artif Intell Res 24:81–108zbMATHGoogle Scholar
  22. 22.
    Hahn EM, Hermanns H, Wachter B, Lijun Z (2010) PARAM: a model checker for parametric Markov models. In: Computer aided verification. Springer, pp 660–664Google Scholar
  23. 23.
    Jansen N, Ábrahám E, Volk M, Wimmer R, Katoen J-P, Becker B (2012) The comics tool–computing minimal counterexamples for dtmcs. In: Automated technology for verification and analysis. Springer, pp 349–353Google Scholar
  24. 24.
    Kwiatkowska M, Norman G, Parker D (2002) Prism: probabilistic symbolic model checker. In: Computer performance evaluation: modelling techniques and tools, pp 113–140Google Scholar
  25. 25.
    Kwiatkowska M, Norman G, Parker D (2007) Stochastic model checking. Formal Methods Perform Eval 220–270Google Scholar
  26. 26.
    Katoen JP, Zapreev IS, Hahn EM, Hermanns H, Jansen DN (2011) The ins and outs of the probabilistic model checker mrmc. Perform Eval 68(2):90–104CrossRefGoogle Scholar
  27. 27.
    Leofante F, Vuotto S, Ȧbrahȧm E, Tacchella A, Jansen N (2016) Combining static and runtime methods to achieve safe standing-up for humanoid robots. In: Leveraging applications of formal methods, verification and validation: foundational techniques - 7th international symposium, ISoLA 2016, Imperial, Corfu, Greece, October 10-14, 2016, Proceedings, Part I, pp 496–514Google Scholar
  28. 28.
    Morimoto J, Doya K (1998) Reinforcement learning of dynamic motor sequence Learning to stand up. In: Proceedings of the 1998 IEEE/RSJ international conference on intelligent robots and systems, vol 3, pp 1721–1726Google Scholar
  29. 29.
    Morimoto J, Doya K (2001) Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robot Auton Syst 36(1):37–51CrossRefzbMATHGoogle Scholar
  30. 30.
    Metta G, Natale L, Nori F, Sandini G, Vernon D, Fadiga L, von Hofsten C, Rosander K, Lopes M, Santos-Victor J et al (2010) The iCub humanoid robot: an open-systems platform for research in cognitive development. Neural networks: the official journal of the international neural network societyGoogle Scholar
  31. 31.
    Metta G, Natale L, Pathak S, Pulina L, Tacchella A (2010) Safe and effective learning: a case study. In: 2010 IEEE international conference on robotics and automation, pp 4809–4814Google Scholar
  32. 32.
    Metta G, Pathak S, Pulina L, Tacchella A (2013) Ensuring safety of policies learned by reinforcement: reaching objects in the presence of obstacles with the iCub. In: IEEE/RSJ international conference on intelligent robots and systems, pp 170–175Google Scholar
  33. 33.
    Ng A, Coates A, Diel M, Ganapathi V, Schulte J, Tse B, Berger E, Liang E (2006) Autonomous inverted helicopter flight via reinforcement learning. Exper Robot IX 363–372Google Scholar
  34. 34.
    Natarajan S, Tadepalli P (2005) Dynamic preferences in multi-criteria reinforcement learning. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 601–608Google Scholar
  35. 35.
    Pathak S, Abraham E, Jansen N, Tacchella A, Katoen JP (2015) A greedy approach for the efficient repair of stochastic models. In: Proc. NFM’15, volume 9058 of LNCS, pp 295–309Google Scholar
  36. 36.
    Perkins TJ, Barto AG (2003) Lyapunov design for safe reinforcement learning. J Mach Learn Res 3:803–832MathSciNetzbMATHGoogle Scholar
  37. 37.
    Pathak S, Metta G, Tacchella A (2014) Is verification a requisite for safe adaptive robots? In: 2014 IEEE international conference on systems, man and cyberneticsGoogle Scholar
  38. 38.
    Pathak S, Pulina L, Tacchella A (2015) Probabilistic model checking tools for verification of robot control policies. AI Commun. To appearGoogle Scholar
  39. 39.
    Puterman ML (2009) Markov decision processes: discrete stochastic dynamic programming, vol 414. WileyGoogle Scholar
  40. 40.
    Rummery GA, Niranjan M (1994) On-line Q-learning using connectionist. University of Cambridge Department of EngineeringGoogle Scholar
  41. 41.
    Russell S, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Prentice HallGoogle Scholar
  42. 42.
    Sutton RS, Barto AG (1998) Reinforcement learning – an introduction. MIT PressGoogle Scholar
  43. 43.
    Singh S, Jaakkola T, Littman ML, Szepesvári C (2000) Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn 38(3):287–308CrossRefzbMATHGoogle Scholar
  44. 44.
    Smith DJ, Simpson KGL (2004) Functional safety – a straightforward guide to applying IEC 61505 and related standards, 2nd edn. ElsevierGoogle Scholar
  45. 45.
    Tesauro G (1995) Temporal difference learning and td-gammon. Commun ACM 38(3):58–68CrossRefGoogle Scholar
  46. 46.
    Wächter A, Biegler LT (2006) On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math Program 106(1):25–57MathSciNetCrossRefzbMATHGoogle Scholar
  47. 47.
    Watkins CJCH, Dayan P (1992) Q-learning. Mach Learn 8(3):279–292zbMATHGoogle Scholar
  48. 48.
    Weld D, Etzioni O (1994) The first law of robotics (a call to arms). In: Proceedings of the 12th national conference on artificial intelligence (AAAI-94), pp 1042–1047Google Scholar
  49. 49.
    Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: IJCAI, vol 95. Citeseer, pp 1114–1120Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Department of Aerospace EngineeringTechnion - Israel Institute of TechnologyTechnion CityIsrael
  2. 2.POLCOMINGUniversità degli Studi di SassariSassariItaly
  3. 3.DIBRISUniversità degli Studi di GenovaGenovaItaly

Personalised recommendations