Deep Reinforcement Learning with Temporal Logics

Hasanbeig, Mohammadhosein; Kroening, Daniel; Abate, Alessandro

doi:10.1007/978-3-030-57628-8_1

Mohammadhosein Hasanbeig¹⁰,
Daniel Kroening¹⁰ &
Alessandro Abate¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12288))

Included in the following conference series:

International Conference on Formal Modeling and Analysis of Timed Systems

1490 Accesses
14 Citations

Abstract

The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Büchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
One-shot means that there is no need to master easy tasks first, then compose them together to accomplish a more complex tasks.
2.
On-the-fly means that the algorithm tracks (or executes) the state of an underlying structure (or a function) without explicitly constructing it.

References

Abate, A., Katoen, J.P., Lygeros, J., Prandini, M.: Approximate model checking of stochastic hybrid systems. Eur. J. Control 16(6), 624–641 (2010)
Article MathSciNet Google Scholar
Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: ICML, pp. 166–175 (2017)
Google Scholar
Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008)
MATH Google Scholar
Belzner, L., Wirsing, M.: Synthesizing safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. arXiv preprint arXiv:2005.03898 (2020)
Bertsekas, D.P., Shreve, S.: Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, USA (2004)
MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming, vol. 1. Athena Scientific, USA (1996)
MATH Google Scholar
Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. arXiv preprint arXiv:1909.07299 (2019)
Brázdil, T., et al.: Verification of Markov decision processes using learning algorithms. In: Cassez, F., Raskin, J.-F. (eds.) ATVA 2014. LNCS, vol. 8837, pp. 98–114. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11936-6_8
Chapter Google Scholar
Daniel, C., Neumann, G., Peters, J.: Hierarchical relative entropy policy search. In: Artificial Intelligence and Statistics, pp. 273–281 (2012)
Google Scholar
De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F.: Imitation learning over heterogeneous agents with restraining bolts. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 517–521 (2020)
Google Scholar
De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: reinforcement learning with LTLf/LDLf restraining specifications. In: ICAPS, vol. 29, pp. 128–136 (2019)
Google Scholar
Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)
Google Scholar
Fulton, N.: Verifiably safe autonomy for cyber-physical systems. Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA (2018)
Google Scholar
Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Fulton, N., Platzer, A.: Verifiably safe off-model reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 413–430. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_28
Chapter Google Scholar
Gunter, E.: From natural language to linear temporal logic: aspects of specifying embedded systems in LTL. In: Monterey Workshop on Software Engineering for Embedded Systems: From Requirements to Implementation (2003)
Google Scholar
Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., Wojtczak, D.: Omega-regular objectives in model-free reinforcement learning. In: Vojnar, T., Zhang, L. (eds.) TACAS 2019. LNCS, vol. 11427, pp. 395–412. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17462-0_27
Chapter Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018)
Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv preprint arXiv:1902.00778 (2019)
Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted Q-iteration. In: AAMAS, pp. 2012–2014 (2019)
Google Scholar
Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 483–491. International Foundation for Autonomous Agents and Multiagent Systems (2020)
Google Scholar
Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Proceedings of the 58th Conference on Decision and Control, pp. 5338–5343. IEEE (2019)
Google Scholar
Hasanbeig, M., Yogananda Jeppu, N., Abate, A., Melham, T., Kroening, D.: Deepsynth: program synthesis for automatic task segmentation in deep reinforcement learning. arXiv preprint arXiv:1911.10244 (2019)
Huang, C., Xu, S., Wang, Z., Lan, S., Li, W., Zhu, Q.: Opportunistic intermittent control with safety guarantees for autonomous systems. arXiv preprint arXiv:2005.03726 (2020)
Hunt, N., Fulton, N., Magliacane, S., Hoang, N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. arXiv preprint arXiv:2007.01223 (2020)
Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in MDPs. arXiv preprint arXiv:1807.06096 (2018)
Junges, S., Jansen, N., Dehnert, C., Topcu, U., Katoen, J.-P.: Safety-constrained reinforcement learning for MDPs. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 130–146. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_8
Chapter Google Scholar
Kazemi, M., Soudjani, S.: Formal policy synthesis for continuous-space systems via reinforcement learning. arXiv preprint arXiv:2005.01319 (2020)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2–3), 209–232 (2002)
Article Google Scholar
Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: NIPS, pp. 3675–3683 (2016)
Google Scholar
Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for continuous-space MDPs via model-free reinforcement learning. In: 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pp. 98–107. IEEE (2020)
Google Scholar
Levy, A., Konidaris, G., Platt, R., Saenko, K.: Learning multi-level hierarchies with hindsight. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: ACC, pp. 240–245 (2018)
Google Scholar
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)
McEwen, A.S., et al.: Recurring slope lineae in equatorial regions of Mars. Nat. Geosci. 7(1), 53–58 (2014)
Article Google Scholar
Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
Newell, R.G., Pizer, W.A.: Discounting the distant future: how much do uncertain rates increase valuations? J. Environ. Econ. Manag. 46(1), 52–71 (2003)
Article Google Scholar
Nikora, A.P., Balcom, G.: Automated identification of LTL patterns in natural language requirements. In: ISSRE, pp. 185–194 (2009)
Google Scholar
Oura, R., Sakakibara, A., Ushio, T.: Reinforcement learning of control policy for linear temporal logic specifications using limit-deterministic generalized Büchi automata. IEEE Control Syst. Lett. 4(3), 761–766 (2020)
Article MathSciNet Google Scholar
Pitis, S.: Rethinking the discount factor in reinforcement learning: a decision theoretic approach. arXiv preprint arXiv:1902.02893 (2019)
Pnueli, A.: The temporal logic of programs. In: Foundations of Computer Science, pp. 46–57 (1977)
Google Scholar
Polymenakos, K., Abate, A., Roberts, S.: Safe policy search using Gaussian process models. In: Proceedings of AAMAS, pp. 1565–1573 (2019)
Google Scholar
Precup, D.: Temporal abstraction in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (2001)
Google Scholar
Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091–1096 (2014)
Google Scholar
Sickert, S., Esparza, J., Jaax, S., Křetínský, J.: Limit-deterministic Büchi automata for linear temporal logic. In: Chaudhuri, S., Farzan, A. (eds.) CAV 2016. LNCS, vol. 9780, pp. 312–332. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41540-6_17
Chapter Google Scholar
Sickert, S., Křetínský, J.: MoChiBA: probabilistic LTL model checking using limit-deterministic Büchi automata. In: Artho, C., Legay, A., Peled, D. (eds.) ATVA 2016. LNCS, vol. 9938, pp. 130–137. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46520-3_9
Chapter Google Scholar
Silver, D., Lever, G., Heess, N., Thomas Degris, D.W., Riedmiller, M.: Deterministic policy gradient algorithms. In: ICML (2014)
Google Scholar
Squyres, S.W.: Exploration of Victoria crater by the Mars rover opportunity. Science 324(5930), 1058–1061 (2009)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, vol. 1. MIT Press, Cambridge (1998)
MATH Google Scholar
Tassa, Y., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)
Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMA, pp. 452–461 (2018)
Google Scholar
Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al.: Strategic attentive writer for learning macro-actions. In: NIPS, pp. 3486–3494 (2016)
Google Scholar
Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
MATH Google Scholar
Wei, Q., Guo, X.: Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39(5), 369–374 (2011)
MathSciNet MATH Google Scholar
Yan, R., Cheng, C.H., Chai, Y.: Formal consistency checking over specifications in natural languages. In: DATE, pp. 1677–1682 (2015)
Google Scholar
Yoshida, N., Uchibe, E., Doya, K.: Reinforcement learning with state-dependent discount factor. In: 2013 IEEE 3rd Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1–6. IEEE (2013)
Google Scholar
Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019)

Download references

Acknowledgements

The authors would like to thank Lim Zun Yuan for valuable discussions and technical support, and to anonymous reviewers for feedback on previous drafts of this manuscript.

Author information

Authors and Affiliations

Department of Computer Science, University of Oxford, Oxford, UK
Mohammadhosein Hasanbeig, Daniel Kroening & Alessandro Abate

Authors

Mohammadhosein Hasanbeig
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kroening
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Abate
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammadhosein Hasanbeig .

Editor information

Editors and Affiliations

Inria, Université de Rennes, Rennes, France
Nathalie Bertrand
Radboud University, Nijmegen, The Netherlands
Nils Jansen

Appendix: Proof of Theorem 1

Theorem 1. Let $\varphi $ be a given LTL formula and $\mathfrak {M}_\mathfrak {A}$ be the product MDP constructed by synchronising the MDP $\mathfrak {M}$ with the LDBA $\mathfrak {A}$ associated with $\varphi $. Then the optimal stationary Markov policy on $\mathfrak {M}_\mathfrak {A}$ that maximises the expected return, maximises the probability of satisfying $\varphi $ and induces a finite-memory policy on the MDP $\mathfrak {M}$.

Proof. Assume that the optimal Markov policy on $\mathfrak {M}_\mathfrak {A}$ is ${\pi ^\otimes }^*$, namely at each state $s^\otimes $ in $\mathfrak {M}_\mathfrak {A}$ we have

$$\begin{aligned} {\pi ^\otimes }^*(s^\otimes )=\arg \!\!\!\!\sup \limits _{\pi ^\otimes \in \mathcal {D}^\otimes } {U}^{\pi ^\otimes }(s^\otimes )=\arg \!\!\!\!\sup \limits _{\pi ^\otimes \in \mathcal {D}^\otimes }\mathbb {E}^{\pi ^\otimes } [\sum _{n=0}^{\infty } \gamma ^n~ R(s^\otimes _n,a_n)|s^\otimes _0=s^\otimes ], \end{aligned}$$

(15)

where $\mathcal {D}^\otimes $ is the set of stationary deterministic policies over the state space $\mathcal {S}^\otimes $, $\mathbb {E}^{\pi ^\otimes } [\cdot ]$ denotes the expectation given that the agent follows policy $\pi ^\otimes $, and $s_0^\otimes ,a_0,s_1^\otimes ,a_1,\ldots $ is a generic path generated by the product MDP under policy $\pi ^\otimes $.

Recall that an infinite word $w \in {\Sigma }^\omega ,~\Sigma =2^\mathcal {AP}$ is accepted by the LDBA $\mathfrak {A}=(\mathcal {Q},q_0,\Sigma , \mathcal {F}, \varDelta )$ if there exists an infinite run $\theta \in \mathcal {Q}^\omega $ starting from $q_0$ where $\theta [i+1] \in \varDelta (\theta [i],\omega [i]),~i \ge 0$ and, for each $F_j \in \mathcal {F}$, $ inf (\theta ) \cap F_j \ne \emptyset ,$ where $ inf (\theta )$ is the set of states that are visited infinitely often in the sequence $\theta $. From Definition 8, the associated run $\theta $ of an infinite path in the product MDP $\rho = s^\otimes _0 \xrightarrow {a_0} s^\otimes _1 \xrightarrow {a_1} ...$ is $\theta = L^\otimes (s^\otimes _0)L^\otimes (s^\otimes _1)...$. From Definition 9 and (10), and since for an accepting run $ inf (\theta ) \,\cap \, F_j \ne \emptyset ,~\forall F_j \in \mathcal {F}$, all accepting paths starting from $s_0^\otimes $, accumulate infinite number of positive rewards $r_p$ (see Remark 2).

In the following, by contradiction, we show that any optimal policy ${\pi ^\otimes }^*$ satisfies the property with maximum possible probability. Let us assume that there exists a stationary deterministic Markov policy ${\pi ^\otimes }^+\ne {\pi ^\otimes }^*$ over the state space $\mathcal {S}^\otimes $ such that probability of satisfying $\varphi $ under ${\pi ^\otimes }^+$ is maximum.

This essentially means in the product MDP $\mathfrak {M}_\mathfrak {A}$ by following ${\pi ^\otimes }^+$ the expectation of reaching the point where $ inf (\theta ) \cap F_j \ne \emptyset ,~\forall F_j \in \mathcal {F}$ and positive reward is received ever after is higher than any other policy, including ${\pi ^\otimes }^*$. With a tuned discount factor $\gamma $, e.g. (1),

$$\begin{aligned} \mathbb {E}^{{\pi ^\otimes }^+} [\sum _{n=0}^{\infty } \gamma ^n~ R(s^\otimes _n,a_n)|s^\otimes _0=s^\otimes ] > \mathbb {E}^{{\pi ^\otimes }^*} [\sum _{n=0}^{\infty } \gamma ^n~ R(s^\otimes _n,a_n)|s^\otimes _0=s^\otimes ] \end{aligned}$$

(16)

This is in contrast with optimality of ${\pi ^\otimes }^*$ (15) and concludes ${\pi ^\otimes }^*={\pi ^\otimes }^+$. Namely, an optimal policy that maximises the expected return also maximises the probability of satisfying LTL property $\varphi $. It is easy to see that the projection of policy ${\pi ^\otimes }^*$ on MDP $\mathfrak {M}$ is a finite-memory policy $\pi ^*$. $\Box $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hasanbeig, M., Kroening, D., Abate, A. (2020). Deep Reinforcement Learning with Temporal Logics. In: Bertrand, N., Jansen, N. (eds) Formal Modeling and Analysis of Timed Systems. FORMATS 2020. Lecture Notes in Computer Science(), vol 12288. Springer, Cham. https://doi.org/10.1007/978-3-030-57628-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-57628-8_1
Published: 25 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57627-1
Online ISBN: 978-3-030-57628-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Reinforcement Learning with Temporal Logics

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proof of Theorem 1

Appendix: Proof of Theorem 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation