Skip to main content

Deep Reinforcement Learning with Temporal Logics

  • Conference paper
  • First Online:
Formal Modeling and Analysis of Timed Systems (FORMATS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12288))

Abstract

The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Büchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    One-shot means that there is no need to master easy tasks first, then compose them together to accomplish a more complex tasks.

  2. 2.

    On-the-fly means that the algorithm tracks (or executes) the state of an underlying structure (or a function) without explicitly constructing it.

References

  1. Abate, A., Katoen, J.P., Lygeros, J., Prandini, M.: Approximate model checking of stochastic hybrid systems. Eur. J. Control 16(6), 624–641 (2010)

    Article  MathSciNet  Google Scholar 

  2. Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: ICML, pp. 166–175 (2017)

    Google Scholar 

  3. Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)

    MATH  Google Scholar 

  4. Belzner, L., Wirsing, M.: Synthesizing safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. arXiv preprint arXiv:2005.03898 (2020)

  5. Bertsekas, D.P., Shreve, S.: Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, USA (2004)

    MATH  Google Scholar 

  6. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming, vol. 1. Athena Scientific, USA (1996)

    MATH  Google Scholar 

  7. Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. arXiv preprint arXiv:1909.07299 (2019)

  8. Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8

    Chapter  Google Scholar 

  9. Daniel, C., Neumann, G., Peters, J.: Hierarchical relative entropy policy search. In: Artificial Intelligence and Statistics, pp. 273–281 (2012)

    Google Scholar 

  10. De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F.: Imitation learning over heterogeneous agents with restraining bolts. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 517–521 (2020)

    Google Scholar 

  11. De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: reinforcement learning with LTLf/LDLf restraining specifications. In: ICAPS, vol. 29, pp. 128–136 (2019)

    Google Scholar 

  12. Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)

    Google Scholar 

  13. Fulton, N.: Verifiably safe autonomy for cyber-physical systems. Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA (2018)

    Google Scholar 

  14. Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  15. Fulton, N., Platzer, A.: Verifiably safe off-model reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 413–430. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_28

    Chapter  Google Scholar 

  16. Gunter, E.: From natural language to linear temporal logic: aspects of specifying embedded systems in LTL. In: Monterey Workshop on Software Engineering for Embedded Systems: From Requirements to Implementation (2003)

    Google Scholar 

  17. Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 395–412. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_27

    Chapter  Google Scholar 

  18. Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)

  19. Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv preprint arXiv:1902.00778 (2019)

  20. Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted Q-iteration. In: AAMAS, pp. 2012–2014 (2019)

    Google Scholar 

  21. Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)

    Google Scholar 

  22. Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Proceedings of the 58th Conference on Decision and Control, pp. 5338–5343. IEEE (2019)

    Google Scholar 

  23. Hasanbeig, M., Yogananda Jeppu, N., Abate, A., Melham, T., Kroening, D.: Deepsynth: program synthesis for automatic task segmentation in deep reinforcement learning. arXiv preprint arXiv:1911.10244 (2019)

  24. Huang, C., Xu, S., Wang, Z., Lan, S., Li, W., Zhu, Q.: Opportunistic intermittent control with safety guarantees for autonomous systems. arXiv preprint arXiv:2005.03726 (2020)

  25. Hunt, N., Fulton, N., Magliacane, S., Hoang, N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. arXiv preprint arXiv:2007.01223 (2020)

  26. Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in MDPs. arXiv preprint arXiv:1807.06096 (2018)

  27. Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.-P.: Safety-constrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_8

    Chapter  Google Scholar 

  28. Kazemi, M., Soudjani, S.: Formal policy synthesis for continuous-space systems via reinforcement learning. arXiv preprint arXiv:2005.01319 (2020)

  29. Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2–3), 209–232 (2002)

    Article  Google Scholar 

  30. Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: NIPS, pp. 3675–3683 (2016)

    Google Scholar 

  31. Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for continuous-space MDPs via model-free reinforcement learning. In: 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pp. 98–107. IEEE (2020)

    Google Scholar 

  32. Levy, A., Konidaris, G., Platt, R., Saenko, K.: Learning multi-level hierarchies with hindsight. In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  33. Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: ACC, pp. 240–245 (2018)

    Google Scholar 

  34. Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)

  35. McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nat. Geosci. 7(1), 53–58 (2014)

    Article  Google Scholar 

  36. Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

  37. Newell, R.G., Pizer, W.A.: Discounting the distant future: how much do uncertain rates increase valuations? J. Environ. Econ. Manag. 46(1), 52–71 (2003)

    Article  Google Scholar 

  38. Nikora, A.P., Balcom, G.: Automated identification of LTL patterns in natural language requirements. In: ISSRE, pp. 185–194 (2009)

    Google Scholar 

  39. Oura, R., Sakakibara, A., Ushio, T.: Reinforcement learning of control policy for linear temporal logic specifications using limit-deterministic generalized Büchi automata. IEEE Control Syst. Lett. 4(3), 761–766 (2020)

    Article  MathSciNet  Google Scholar 

  40. Pitis, S.: Rethinking the discount factor in reinforcement learning: a decision theoretic approach. arXiv preprint arXiv:1902.02893 (2019)

  41. Pnueli, A.: The temporal logic of programs. In: Foundations of Computer Science, pp. 46–57 (1977)

    Google Scholar 

  42. Polymenakos, K., Abate, A., Roberts, S.: Safe policy search using Gaussian process models. In: Proceedings of AAMAS, pp. 1565–1573 (2019)

    Google Scholar 

  43. Precup, D.: Temporal abstraction in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (2001)

    Google Scholar 

  44. Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091–1096 (2014)

    Google Scholar 

  45. Sickert, S., Esparza, J., Jaax, S., Křetínský, J.: Limit-deterministic Büchi automata for linear temporal logic. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 312–332. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41540-6_17

    Chapter  Google Scholar 

  46. Sickert, S., Křetínský, J.: MoChiBA: probabilistic LTL model checking using limit-deterministic Büchi automata. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 130–137. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46520-3_9

    Chapter  Google Scholar 

  47. Silver, D., Lever, G., Heess, N., Thomas Degris, D.W., Riedmiller, M.: Deterministic policy gradient algorithms. In: ICML (2014)

    Google Scholar 

  48. Squyres, S.W.: Exploration of Victoria crater by the Mars rover opportunity. Science 324(5930), 1058–1061 (2009)

    Article  Google Scholar 

  49. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  50. Tassa, Y., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)

  51. Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMA, pp. 452–461 (2018)

    Google Scholar 

  52. Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al.: Strategic attentive writer for learning macro-actions. In: NIPS, pp. 3486–3494 (2016)

    Google Scholar 

  53. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    MATH  Google Scholar 

  54. Wei, Q., Guo, X.: Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39(5), 369–374 (2011)

    MathSciNet  MATH  Google Scholar 

  55. Yan, R., Cheng, C.H., Chai, Y.: Formal consistency checking over specifications in natural languages. In: DATE, pp. 1677–1682 (2015)

    Google Scholar 

  56. Yoshida, N., Uchibe, E., Doya, K.: Reinforcement learning with state-dependent discount factor. In: 2013 IEEE 3rd Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6. IEEE (2013)

    Google Scholar 

  57. Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)

Download references

Acknowledgements

The authors would like to thank Lim Zun Yuan for valuable discussions and technical support, and to anonymous reviewers for feedback on previous drafts of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammadhosein Hasanbeig .

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

Theorem 1Let \(\varphi \) be a given LTL formula and \(\mathfrak {M}_\mathfrak {A}\) be the product MDP constructed by synchronising the MDP \(\mathfrak {M}\) with the LDBA \(\mathfrak {A}\) associated with \(\varphi \). Then the optimal stationary Markov policy on \(\mathfrak {M}_\mathfrak {A}\) that maximises the expected return, maximises the probability of satisfying \(\varphi \) and induces a finite-memory policy on the MDP \(\mathfrak {M}\).

Proof. Assume that the optimal Markov policy on \(\mathfrak {M}_\mathfrak {A}\) is \({\pi ^\otimes }^*\), namely at each state \(s^\otimes \) in \(\mathfrak {M}_\mathfrak {A}\) we have

$$\begin{aligned} {\pi ^\otimes }^*(s^\otimes )=\arg \!\!\!\!\sup \limits _{\pi ^\otimes \in \mathcal {D}^\otimes } {U}^{\pi ^\otimes }(s^\otimes )=\arg \!\!\!\!\sup \limits _{\pi ^\otimes \in \mathcal {D}^\otimes }\mathbb {E}^{\pi ^\otimes } [\sum _{n=0}^{\infty } \gamma ^n~ R(s^\otimes _n,a_n)|s^\otimes _0=s^\otimes ], \end{aligned}$$
(15)

where \(\mathcal {D}^\otimes \) is the set of stationary deterministic policies over the state space \(\mathcal {S}^\otimes \), \(\mathbb {E}^{\pi ^\otimes } [\cdot ]\) denotes the expectation given that the agent follows policy \(\pi ^\otimes \), and \(s_0^\otimes ,a_0,s_1^\otimes ,a_1,\ldots \) is a generic path generated by the product MDP under policy \(\pi ^\otimes \).

Recall that an infinite word \(w \in {\Sigma }^\omega ,~\Sigma =2^\mathcal {AP}\) is accepted by the LDBA \(\mathfrak {A}=(\mathcal {Q},q_0,\Sigma , \mathcal {F}, \varDelta )\) if there exists an infinite run \(\theta \in \mathcal {Q}^\omega \) starting from \(q_0\) where \(\theta [i+1] \in \varDelta (\theta [i],\omega [i]),~i \ge 0\) and, for each \(F_j \in \mathcal {F}\), \( inf (\theta ) \cap F_j \ne \emptyset ,\) where \( inf (\theta )\) is the set of states that are visited infinitely often in the sequence \(\theta \). From Definition 8, the associated run \(\theta \) of an infinite path in the product MDP \(\rho = s^\otimes _0 \xrightarrow {a_0} s^\otimes _1 \xrightarrow {a_1} ...\) is \(\theta = L^\otimes (s^\otimes _0)L^\otimes (s^\otimes _1)...\). From Definition 9 and (10), and since for an accepting run \( inf (\theta ) \,\cap \, F_j \ne \emptyset ,~\forall F_j \in \mathcal {F}\), all accepting paths starting from \(s_0^\otimes \), accumulate infinite number of positive rewards \(r_p\) (see Remark 2).

In the following, by contradiction, we show that any optimal policy \({\pi ^\otimes }^*\) satisfies the property with maximum possible probability. Let us assume that there exists a stationary deterministic Markov policy \({\pi ^\otimes }^+\ne {\pi ^\otimes }^*\) over the state space \(\mathcal {S}^\otimes \) such that probability of satisfying \(\varphi \) under \({\pi ^\otimes }^+\) is maximum.

This essentially means in the product MDP \(\mathfrak {M}_\mathfrak {A}\) by following \({\pi ^\otimes }^+\) the expectation of reaching the point where \( inf (\theta ) \cap F_j \ne \emptyset ,~\forall F_j \in \mathcal {F}\) and positive reward is received ever after is higher than any other policy, including \({\pi ^\otimes }^*\). With a tuned discount factor \(\gamma \), e.g. (1),

$$\begin{aligned} \mathbb {E}^{{\pi ^\otimes }^+} [\sum _{n=0}^{\infty } \gamma ^n~ R(s^\otimes _n,a_n)|s^\otimes _0=s^\otimes ] > \mathbb {E}^{{\pi ^\otimes }^*} [\sum _{n=0}^{\infty } \gamma ^n~ R(s^\otimes _n,a_n)|s^\otimes _0=s^\otimes ] \end{aligned}$$
(16)

This is in contrast with optimality of \({\pi ^\otimes }^*\) (15) and concludes \({\pi ^\otimes }^*={\pi ^\otimes }^+\). Namely, an optimal policy that maximises the expected return also maximises the probability of satisfying LTL property \(\varphi \). It is easy to see that the projection of policy \({\pi ^\otimes }^*\) on MDP \(\mathfrak {M}\) is a finite-memory policy \(\pi ^*\).    \(\Box \)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hasanbeig, M., Kroening, D., Abate, A. (2020). Deep Reinforcement Learning with Temporal Logics. In: Bertrand, N., Jansen, N. (eds) Formal Modeling and Analysis of Timed Systems. FORMATS 2020. Lecture Notes in Computer Science(), vol 12288. Springer, Cham. https://doi.org/10.1007/978-3-030-57628-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57628-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57627-1

  • Online ISBN: 978-3-030-57628-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics