Deep Reinforcement Learning with Temporal Logics

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12288)


The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Büchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.


Model-free reinforcement learning Deep learning Linear temporal logic Continuous-state and continuous-action Markov decision processes 



The authors would like to thank Lim Zun Yuan for valuable discussions and technical support, and to anonymous reviewers for feedback on previous drafts of this manuscript.


  1. 1.
    Abate, A., Katoen, J.P., Lygeros, J., Prandini, M.: Approximate model checking of stochastic hybrid systems. Eur. J. Control 16(6), 624–641 (2010)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: ICML, pp. 166–175 (2017)Google Scholar
  3. 3.
    Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)zbMATHGoogle Scholar
  4. 4.
    Belzner, L., Wirsing, M.: Synthesizing safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. arXiv preprint arXiv:2005.03898 (2020)
  5. 5.
    Bertsekas, D.P., Shreve, S.: Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, USA (2004)zbMATHGoogle Scholar
  6. 6.
    Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming, vol. 1. Athena Scientific, USA (1996)zbMATHGoogle Scholar
  7. 7.
    Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. arXiv preprint arXiv:1909.07299 (2019)
  8. 8.
    Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). Scholar
  9. 9.
    Daniel, C., Neumann, G., Peters, J.: Hierarchical relative entropy policy search. In: Artificial Intelligence and Statistics, pp. 273–281 (2012)Google Scholar
  10. 10.
    De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F.: Imitation learning over heterogeneous agents with restraining bolts. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 517–521 (2020)Google Scholar
  11. 11.
    De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: reinforcement learning with LTLf/LDLf restraining specifications. In: ICAPS, vol. 29, pp. 128–136 (2019)Google Scholar
  12. 12.
    Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)Google Scholar
  13. 13.
    Fulton, N.: Verifiably safe autonomy for cyber-physical systems. Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA (2018)Google Scholar
  14. 14.
    Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)Google Scholar
  15. 15.
    Fulton, N., Platzer, A.: Verifiably safe off-model reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 413–430. Springer, Cham (2019). Scholar
  16. 16.
    Gunter, E.: From natural language to linear temporal logic: aspects of specifying embedded systems in LTL. In: Monterey Workshop on Software Engineering for Embedded Systems: From Requirements to Implementation (2003)Google Scholar
  17. 17.
    Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 395–412. Springer, Cham (2019). Scholar
  18. 18.
    Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
  19. 19.
    Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv preprint arXiv:1902.00778 (2019)
  20. 20.
    Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted Q-iteration. In: AAMAS, pp. 2012–2014 (2019)Google Scholar
  21. 21.
    Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)Google Scholar
  22. 22.
    Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Proceedings of the 58th Conference on Decision and Control, pp. 5338–5343. IEEE (2019)Google Scholar
  23. 23.
    Hasanbeig, M., Yogananda Jeppu, N., Abate, A., Melham, T., Kroening, D.: Deepsynth: program synthesis for automatic task segmentation in deep reinforcement learning. arXiv preprint arXiv:1911.10244 (2019)
  24. 24.
    Huang, C., Xu, S., Wang, Z., Lan, S., Li, W., Zhu, Q.: Opportunistic intermittent control with safety guarantees for autonomous systems. arXiv preprint arXiv:2005.03726 (2020)
  25. 25.
    Hunt, N., Fulton, N., Magliacane, S., Hoang, N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. arXiv preprint arXiv:2007.01223 (2020)
  26. 26.
    Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in MDPs. arXiv preprint arXiv:1807.06096 (2018)
  27. 27.
    Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.-P.: Safety-constrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). Scholar
  28. 28.
    Kazemi, M., Soudjani, S.: Formal policy synthesis for continuous-space systems via reinforcement learning. arXiv preprint arXiv:2005.01319 (2020)
  29. 29.
    Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2–3), 209–232 (2002)CrossRefGoogle Scholar
  30. 30.
    Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: NIPS, pp. 3675–3683 (2016)Google Scholar
  31. 31.
    Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for continuous-space MDPs via model-free reinforcement learning. In: 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pp. 98–107. IEEE (2020)Google Scholar
  32. 32.
    Levy, A., Konidaris, G., Platt, R., Saenko, K.: Learning multi-level hierarchies with hindsight. In: International Conference on Learning Representations (ICLR) (2019)Google Scholar
  33. 33.
    Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: ACC, pp. 240–245 (2018)Google Scholar
  34. 34.
    Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)
  35. 35.
    McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nat. Geosci. 7(1), 53–58 (2014)CrossRefGoogle Scholar
  36. 36.
    Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  37. 37.
    Newell, R.G., Pizer, W.A.: Discounting the distant future: how much do uncertain rates increase valuations? J. Environ. Econ. Manag. 46(1), 52–71 (2003)CrossRefGoogle Scholar
  38. 38.
    Nikora, A.P., Balcom, G.: Automated identification of LTL patterns in natural language requirements. In: ISSRE, pp. 185–194 (2009)Google Scholar
  39. 39.
    Oura, R., Sakakibara, A., Ushio, T.: Reinforcement learning of control policy for linear temporal logic specifications using limit-deterministic generalized Büchi automata. IEEE Control Syst. Lett. 4(3), 761–766 (2020)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Pitis, S.: Rethinking the discount factor in reinforcement learning: a decision theoretic approach. arXiv preprint arXiv:1902.02893 (2019)
  41. 41.
    Pnueli, A.: The temporal logic of programs. In: Foundations of Computer Science, pp. 46–57 (1977)Google Scholar
  42. 42.
    Polymenakos, K., Abate, A., Roberts, S.: Safe policy search using Gaussian process models. In: Proceedings of AAMAS, pp. 1565–1573 (2019)Google Scholar
  43. 43.
    Precup, D.: Temporal abstraction in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (2001)Google Scholar
  44. 44.
    Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091–1096 (2014)Google Scholar
  45. 45.
    Sickert, S., Esparza, J., Jaax, S., Křetínský, J.: Limit-deterministic Büchi automata for linear temporal logic. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 312–332. Springer, Cham (2016). Scholar
  46. 46.
    Sickert, S., Křetínský, J.: MoChiBA: probabilistic LTL model checking using limit-deterministic Büchi automata. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 130–137. Springer, Cham (2016). Scholar
  47. 47.
    Silver, D., Lever, G., Heess, N., Thomas Degris, D.W., Riedmiller, M.: Deterministic policy gradient algorithms. In: ICML (2014)Google Scholar
  48. 48.
    Squyres, S.W.: Exploration of Victoria crater by the Mars rover opportunity. Science 324(5930), 1058–1061 (2009)CrossRefGoogle Scholar
  49. 49.
    Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  50. 50.
    Tassa, Y., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)
  51. 51.
    Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMA, pp. 452–461 (2018)Google Scholar
  52. 52.
    Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al.: Strategic attentive writer for learning macro-actions. In: NIPS, pp. 3486–3494 (2016)Google Scholar
  53. 53.
    Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)zbMATHGoogle Scholar
  54. 54.
    Wei, Q., Guo, X.: Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39(5), 369–374 (2011)MathSciNetzbMATHGoogle Scholar
  55. 55.
    Yan, R., Cheng, C.H., Chai, Y.: Formal consistency checking over specifications in natural languages. In: DATE, pp. 1677–1682 (2015)Google Scholar
  56. 56.
    Yoshida, N., Uchibe, E., Doya, K.: Reinforcement learning with state-dependent discount factor. In: 2013 IEEE 3rd Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6. IEEE (2013)Google Scholar
  57. 57.
    Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of OxfordOxfordUK

Personalised recommendations