Advertisement

Reinforcement Learning and Formal Requirements

  • Fabio Somenzi
  • Ashutosh TrivediEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11652)

Abstract

Reinforcement learning is an approach to controller synthesis where agents rely on reward signals to choose actions in order to satisfy the requirements implicit in reward signals. Oftentimes non-experts have to come up with the requirements and their translation to rewards under significant time pressure, even though manual translation is time consuming and error prone. For safety-critical applications of reinforcement learning a rigorous design methodology is needed and, in particular, a principled approach to requirement specification and to the translation of objectives into the form required by reinforcement learning algorithms.

Formal logic provides a foundation for the rigorous and unambiguous requirement specification of learning objectives. However, reinforcement learning algorithms require requirements to be expressed as scalar reward signals. We discuss a recent technique, called limit-reachability, that bridges this gap by faithfully translating logic-based requirements into the scalar reward form needed in model-free reinforcement learning. This technique enables the synthesis of controllers that maximize the probability to satisfy given logical requirements using off-the-shelf, model-free reinforcement learning algorithms.

References

  1. 1.
    Aksaray, D., Jones, A., Kong, Z., Schwager, M., Belta, C.: Q-learning for robust satisfaction of signal temporal logic specifications. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 6565–6570 (2016)Google Scholar
  2. 2.
    de Alfaro, L.: Formal verification of probabilistic systems. Ph.D. thesis, Stanford University (1998)Google Scholar
  3. 3.
    Baier, C., Größer, M.: Recognizing \(\omega \)-regular languages with probabilistic automata. In: Logic in Computer Science (LICS 2005), pp. 137–146, June 2005Google Scholar
  4. 4.
    Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)zbMATHGoogle Scholar
  5. 5.
    Banach, S.: Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundam. Math. 3(1), 133–181 (1922). http://eudml.org/doc/213289CrossRefGoogle Scholar
  6. 6.
    Borkar, V.S., Meyn, S.P.: The ode method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11936-6_8CrossRefGoogle Scholar
  8. 8.
    Chatterjee, K., Henzinger, M.: Faster and dynamic algorithms for maximal end-component decomposition and related graph problems in probabilistic verification. In: Symposium on Discrete Algorithms (SODA), pp. 1318–1336, January 2011Google Scholar
  9. 9.
    Clarke, E.M., Emerson, E.A.: Design and synthesis of synchronization skeletons using branching time temporal logic. In: Kozen, D. (ed.) Logic of Programs 1981. LNCS, vol. 131, pp. 52–71. Springer, Heidelberg (1982).  https://doi.org/10.1007/BFb0025774CrossRefGoogle Scholar
  10. 10.
    Clarke, E.M., Grumberg, O., Peled, D.A.: Model Checking. MIT Press, Cambridge (1999)Google Scholar
  11. 11.
    Courcoubetis, C., Yannakakis, M.: The complexity of probabilistic verification. J. ACM 42(4), 857–907 (1995)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Donzé, A., Maler, O.: Robust satisfaction of temporal logic over real-valued signals. In: Chatterjee, K., Henzinger, T.A. (eds.) FORMATS 2010. LNCS, vol. 6246, pp. 92–106. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15297-9_9CrossRefzbMATHGoogle Scholar
  13. 13.
    Fainekos, G., Pappas, G.J.: Robustness of temporal logic specifications for continuous-time signals. Theor. Comput. Sci. 410, 4262–4291 (2009)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Fainekos, G.E.: Robustness of temporal logic specifications. Ph.D. thesis, Department of Computer and Information Science, University of Pennsylvania (2008)Google Scholar
  15. 15.
    Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems, July 2014Google Scholar
  16. 16.
    Hahn, E.M., Li, G., Schewe, S., Turrini, A., Zhang, L.: Lazy probabilistic model checking without determinisation. CoRR abs/1311.2928 (2013). http://arxiv.org/abs/1311.2928
  17. 17.
    Hahn, E.M., Li, G., Schewe, S., Turrini, A., Zhang, L.: Lazy probabilistic model checking without determinisation. In: Concurrency Theory, (CONCUR), pp. 354–367 (2015)Google Scholar
  18. 18.
    Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 395–412. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-17462-0_27CrossRefGoogle Scholar
  19. 19.
    Hasanbeig, M., Abate, A., Kroening, D.: Logically-correct reinforcement learning. CoRR abs/1801.08099 (2018). http://arxiv.org/abs/1801.08099
  20. 20.
    Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv e-prints arXiv:1902.00778, February 2019
  21. 21.
    Hiromoto, M., Ushio, T.: Learning an optimal control policy for a Markov decision process under linear temporal logic specifications. In: Symposium Series on Computational Intelligence, pp. 548–555, December 2015Google Scholar
  22. 22.
    Hordijk, A., Yushkevich, A.A.: Blackwell optimality. In: Feinberg, E.A., Shwartz, A. (eds.) Handbook of Markov Decision Processes: Methods and Applications, pp. 231–267. Springer, Boston (2002).  https://doi.org/10.1007/978-1-4615-0805-2_8CrossRefGoogle Scholar
  23. 23.
    Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22110-1_47CrossRefGoogle Scholar
  24. 24.
    Lahijanian, M., Andersson, S.B., Belta, C.: Temporal logic motion planning and control with probabilistic satisfaction guarantees. IEEE Trans. Robot. 28(2), 396–409 (2012)CrossRefGoogle Scholar
  25. 25.
    Lahijanian, M., Maly, M.R., Fried, D., Kavraki, L.E., Kress-Gazit, H., Vardi, M.Y.: Iterative temporal planning in uncertain environments with partial satisfaction guarantees. IEEE Trans. Robot. 32(3), 538–599 (2016)CrossRefGoogle Scholar
  26. 26.
    Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016). http://dl.acm.org/citation.cfm?id=2946645.2946684MathSciNetzbMATHGoogle Scholar
  27. 27.
    Li, X., Vasile, C.I., Belta, C.: Reinforcement learning with temporal logic rewards. In: International Conference on Intelligent Robots and Systems (IROS), pp. 3834–3839 (2017)Google Scholar
  28. 28.
    Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. CoRR abs/1509.02971 (2015). http://arxiv.org/abs/1509.02971
  29. 29.
    Liu, R., et al.: An intriguing failing of convolutional neural networks and the CoordConv solution. ArXiv e-prints 1807.03247, July 2018
  30. 30.
    Manna, Z., Pnueli, A.: The Temporal Logic of Reactive and Concurrent Systems *Specification*. Springer, New York (1992).  https://doi.org/10.1007/978-1-4612-0931-7CrossRefzbMATHGoogle Scholar
  31. 31.
    Mnih, V., et al.: Human-level control through reinforcement learning. Nature 518, 529–533 (2015)CrossRefGoogle Scholar
  32. 32.
    Pnueli, A.: The temporal logic of programs. In: IEEE Symposium on Foundations of Computer Science, pp. 46–57 (1977)Google Scholar
  33. 33.
    Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York (1994)CrossRefGoogle Scholar
  34. 34.
    Queille, J.P., Sifakis, J.: Specification and verification of concurrent systems in CESAR. In: Dezani-Ciancaglini, M., Montanari, U. (eds.) Programming 1982. LNCS, vol. 137, pp. 337–351. Springer, Heidelberg (1982).  https://doi.org/10.1007/3-540-11494-7_22CrossRefGoogle Scholar
  35. 35.
    Riedmiller, M.: Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg (2005).  https://doi.org/10.1007/11564096_32CrossRefGoogle Scholar
  36. 36.
    Sadigh, D., Kim, E., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: IEEE Conference on Decision and Control (CDC), pp. 1091–1096, December 2014Google Scholar
  37. 37.
    Sickert, S., Esparza, J., Jaax, S., Křetínský, J.: Limit-deterministic Büchi automata for linear temporal logic. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 312–332. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-41540-6_17CrossRefGoogle Scholar
  38. 38.
    Sickert, S., Křetínský, J.: MoChiBA: probabilistic LTL model checking using limit-deterministic Büchi automata. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 130–137. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46520-3_9CrossRefGoogle Scholar
  39. 39.
    Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: International Conference on Machine Learning, ICML, pp. 881–888 (2006)Google Scholar
  40. 40.
    Sutton, R.S., Barto, A.G.: Reinforcement Learnging: An Introduction, 2nd edn. MIT Press, Cambridge (2018)zbMATHGoogle Scholar
  41. 41.
    Tsitsiklis, J.N., Roy, B.V.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42(5), 674–690 (1997)MathSciNetCrossRefGoogle Scholar
  42. 42.
  43. 43.
    Vardi, M.Y.: Automatic verification of probabilistic concurrent finite state programs. In: FOCS, pp. 327–338 (1985)Google Scholar
  44. 44.
    Wang, J., Ding, X.C., Lahijanian, M., Paschalidis, I.C., Belta, C.: Temporal logic motion control using actor-critic methods. Int. J. Robot. Res. 34(10), 1329–1344 (2015)CrossRefGoogle Scholar
  45. 45.
    Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)zbMATHGoogle Scholar
  46. 46.
    Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge, UK, May 1989Google Scholar
  47. 47.

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Electrical, Computer, and Energy EngineeringUniversity of Colorado BoulderBoulderUSA
  2. 2.Department of Computer ScienceUniversity of Colorado BoulderBoulderUSA

Personalised recommendations